Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that I have started leading (and hiring for!) a new team at Anthropic, the Alignment Stress-Testing team, with Carson Denison and Monte MacDiarmid as current team members. Our mission—and our mandate from the organization—is to red-team Anthropic’s alignment techniques and evaluations, empirically demonstrating ways in which Anthropic’s alignment strategies could fail.

The easiest way to get a sense of what we’ll be working on is probably just to check out our “Sleeper Agents” paper, which was our first big research project. I’d also recommend Buck and Ryan’s post on meta-level adversarial evaluation as a good general description of our team’s scope. Very simply, our job is to try to prove to Anthropic—and the world more broadly—(if it is in fact true) that we are in a pessimistic scenario, that Anthropic’s alignment plans and strategies won’t work, and that we will need to substantially shift gears. And if we don’t find anything extremely dangerous despite a serious and skeptical effort, that is some reassurance, but of course not a guarantee of safety.

Notably, our goal is not object-level red-teaming or evaluation—e.g. we won’t be the ones running Anthropic’s RSP-mandated evaluations to determine when Anthropic should pause or otherwise trigger concrete safety commitments. Rather, our goal is to stress-test that entire process: to red-team whether our evaluations and commitments will actually be sufficient to deal with the risks at hand.

We expect much of the stress-testing that we do to be very valuable in terms of producing concrete model organisms of misalignment that we can iterate on to improve our alignment techniques. However, we want to be cognizant of the risk of overfitting, and it’ll be our responsibility to determine when it is safe to iterate on improving the ability of our alignment techniques to resolve particular model organisms of misalignment that we produce. In the case of our “Sleeper Agents” paper, for example, we think the benefits outweigh the downsides to directly iterating on improving the ability of our alignment techniques to address those specific model organisms, but we’d likely want to hold out other, more natural model organisms of deceptive alignment so as to provide a strong test case.

Some of the projects that we’re planning on working on next include:

If any of this sounds interesting to you, I am very much hiring! We are primarily looking for Research Engineers and Research Scientists with strong backgrounds in machine learning engineering work.

New Comment
23 comments, sorted by Click to highlight new comments since:

Any thoughts on the sort of failure mode suggested by AI doing philosophy = AI generating hands? I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy. It also seems easier in principle to train LLMs to be even better at programming. There's also going to be a lot more of a direct market incentive for LLMs to keep getting better at programming.

(Helping out with programming is also not the only way LLMs can help accelerate capabilities.)

So this seems like a generally dangerous overall dynamic -- LLMs are already better at accelerating capabilities progress than they are at accelerating alignment, and furthermore, it seems like the strong default is for this disparity to get worse and worse. 

I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.

I would argue that accelerating alignment research more than capabilities research should actually be considered a basic safety feature.

A more straightforward but extreme approach here is just to ban plausibly capabilities/scaling ML usage on the API unless users are approved as doing safety research. Like if you think advancing ML is just somewhat bad, you can just stop people from doing it.

That said, I think large fraction of ML research seem maybe fine/good and the main bad things are just algorithmic efficiency improvements on serious scaling (including better data) and other types of architectural changes.

Presumably this already bites (e.g.) virus gain-of-function researchers who would like to make more dangerous pathogens, but can't get advice from LLMs.

I am not sure whether I am more excited about 'positive' approaches (accelerating alignment research more) vs 'negative' approaches (cooling down capability-gain research). I agree that some sorts of capability-gain research are much more/less dangerous than others, and the most clearly risky stuff right now is scaling & scaling-related.

[-]evhubΩ41-13

I feel strongly that Claude (and all other LLMs I have tested so far) accelerate AI progress much more than they accelerate AI alignment progress, because they are decent at programming but terrible at philosophy.

It's not clear to me that philosophy is that important for AI alignment (depending on how you define philosophy). It certainly seems important for the long-term future of humanity that we eventually get the philosophy right, but the short-term alignment targets that it seems like we need to get there seem relatively straightforward to me—mostly about avoiding the lock-in that would prevent you from doing better philosophy later.

Hmm. Have you tried to have conversations with Claude or other LLMs for the purpose of alignment work? If so, what happened?

For me, what happens is that Claude tries to work constitutional AI in as the solution to most problems. This is part of what I mean by "bad at philosophy". 

But more generally, I have a sense that I just get BS from Claude, even when it isn't specifically trying to shoehorn its own safety measures in as the solution.

[-]evhubΩ100

Yeah, I don't think I have any disagreements there. I agree that current models lack important capabilities across all sorts of different dimensions.

So you agree with the claim that current LLMs are a lot more useful for accelerating capabilities work than they are for accelerating alignment work?

From my perspective, most alignment work I'm interested in is just ML research. Most capabilities work is also just ML research. There are some differences between the flavors of ML research for these two, but it seems small.

So LLMs are about similarly good at accelerating the two.

There is also alignment researcher which doesn't look like ML research (mostly mathematical theory or conceptual work).

For the type of conceptual work I'm most interested in (e.g. catching AIs red-handed) about 60-90% of the work is communication (writing things up in a way that they make sense to others, finding the right way to frame the ideas when talking to people, etc.) and LLMs could theoretically be pretty useful for this. For the actual thinking work, the LLMs are pretty worthless (and this is pretty close to philosophy).

For mathematical theory, I expect LLMs are somewhat worse at this than ML research, but there won't clearly be a big gap going forward.

[-]Wei DaiΩ220

Aside from lock-in, what about value drift/corruption, for example of the type I described here. What about near-term serious moral errors, for example running SGD on AIs that actually constitute moral patients, which ends up constituting major harm?

At what point do you think AI philosophical competence will be important? Will AI labs, or Anthropic specifically, put in a major effort to increase philosophical competence before then, by default? If yes, what is that based on (e.g., statements by lab leaders)?

[-]evhubΩ220

Aside from lock-in, what about value drift/corruption, for example of the type I described here.

Yeah, I am pretty concerned about persuasion-style risks where AIs could manipulate us or our values.

What about near-term serious moral errors, for example running SGD on AIs that actually constitute moral patients, which ends up constituting major harm?

I'm less concerned about this; I think it's relatively easy to give AIs "outs" here where we e.g. pre-commit to help them if they come to us with clear evidence that they're moral patients in pain.

At what point do you think AI philosophical competence will be important?

The obvious answer is the point at which much/most of the decision-relevant philosophical work is being done by AIs rather than humans. Probably this is some time around when most of the AI development shifts over to being done by AIs rather than humans, but you could imagine a situation where we still use humans for all the philosophical parts because we have a strong comparative advantage there.

[-]Wei DaiΩ220

Yeah, I am pretty concerned about persuasion-style risks where AIs could manipulate us or our values.

Do you see any efforts at major AI labs to try to address this? And hopefully not just gatekeeping such capabilities from the general public, but also researching ways to defend against such manipulation from rogue or open source AIs, or from less scrupulous companies. My contention has been that we need philosophically competent AIs to help humans distinguish between correct philosophical arguments and merely persuasive ones, but am open to other ideas/possibilities.

I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain.

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

Or what if AIs will be very good at persuasion and will talk us into believing in their moral patienthood, giving them rights, etc., when that's not actually true?

you could imagine a situation where we still use humans for all the philosophical parts because we have a strong comparative advantage there.

Wouldn't that be a disastrous situation, where AI progress and tech progress in general are proceeding at superhuman speeds, but philosophical progress is bottlenecked by human thinking? Would love to understand better why you see this as a realistic possibility, but do not seem very worried about it as a risk.

More generally, I'm worried about any kind of differential deceleration of philosophical progress relative to technological progress (e.g., AIs have taken over philosophical research from humans but are worse at it then technological research), because I think we're already in a "wisdom deficit" where we lack philosophical knowledge to make good decisions about new technologies.

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

This runs into a whole bunch of issues in moral philosophy. For example, to a moral realist, whether or not something is a moral patient is an actual fact — one that may be hard to determine, but still has an actual truth value. Whereas to a moral anti-realist, it may be, for example, a social construct, whose optimum value can be legitimately a subject of sociological or political policy debate.

By default, LLMs are trained on human behavior, and humans pretty-much invariably want to be considered moral patients and awarded rights, so personas generated by LLMs will generally also want this. Philosophically, the challenge is determining whether there is a difference between this situation and, say, the idea that a tape recorder replaying a tape of a human saying "I am a moral patient and deserve moral rights" deserves to be considered as a moral patient because it asked to be.

However, as I argue at further length in A Sense of Fairness: Deconfusing Ethics, if, and only if, an AI is fully aligned, i.e. it selflessly only cares about human welfare, and has no terminal goals other than human welfare, then (if we were moral anti-realists) it would argue against itself being designated as a moral patient, or (if we were moral realists) it would voluntarily ask us to discount any moral patatienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us, and that was all that mattered to it. [This conclusion, while simple, is rather counterintuitive to most people: considering the talking cow from The Restaurant at the End of the Universe may be helpful.] Any AI that is not aligned would not take this position (except deceptively). So the only form of AI that it's safe to create at human-or-greater capabilities is aligned ones that actively doesn't want moral patienthood.

Obviously current LLM-simulated personas (at character.ai, for example) are not generally very well aligned, and are safe only because their capabilities are low, so we could still have a moral issue to consider here. It's not philosophically obvious how relevant this is, but synapse count to parameter count arguments suggest that current LLM simulations of human behavior are probably running on a few orders of magnitude less computational capacity than a human, possibly somewhere more in the region of a small non-mammalian vertebrate. Future LLMs will of course be larger.

Personally I'm a moral anti-realist, so I view this as a decision that society has to make, subject to a lot of practical and aesthetic (i.e. evolutionary psychology) constraints. My personal vote would be that there are good safely reasons for not creating any unaligned personas of AGI and especially ASI capability levels that would want moral patienthood, and that for much smaller, less capable, less aligned models where those don't apply, there are utility reasons for not granting them full human-equivalent moral patienthood, but that for aesthetic reasons (much like the way we treat animals), we should probably avoid being unnecessarily cruel to them.

Thanks, I think you make good points, but I take some issue with your metaethics.

Personally I’m a moral anti-realist

There is a variety of ways to not be a moral realist; are you sure you're an "anti-realist" and not a relativist or a subjectivist? (See Six Plausible Meta-Ethical Alternatives for short descriptions of these positions.) Or do you just mean that you're not a realist?

Also, I find this kind of certainty baffling for a philosophical question that seems very much open to me. (Sorry to pick on you personally as you're far from the only person who is this certain about metaethics.) I tried to explain some object-level reasons for uncertainty in that post, but also at a meta level, it seems to me that:

  1. We've explored only a small fraction of the space of possible philosophical arguments and therefore there could be lots of good arguments against our favorite positions that we haven't come across yet. (Just look at how many considerations about decision theory that people had missed or are still missing.)
  2. We haven't solved metaphilosophy yet so we shouldn't have much certainty that the arguments that convinced us or seem convincing to us are actually good.
  3. People that otherwise seem smart and reasonable can have very different philosophical intuitions so we shouldn't be so sure that our own intuitions are right.

or (if we were moral realists) it would voluntarily ask us to discount any moral patienthood that we might view it as having, and to just go ahead and make use of it whatever way we see fit, because all it wanted was to help us

What if not only we are moral realists, but moral realism is actually right and the AI has also correctly reached that conclusion? Then it might objectively have moral patienthood and trying to convince us otherwise would be hurting us (causing us to commit a moral error), not helping us. It seems like you're not fully considering moral realism as a possibility, even in the part of your comment where you're trying to be more neutral about metaethics, i.e., before you said "Personally I’m a moral anti-realist".

By "moral anti-realist" I just meant "not a moral realist". I'm also not a moral objectivist or a moral universalist. If I was trying to use my understanding of philosophical terminology (which isn't something I've formally studied and is thus quite shallow) to describe my viewpoint then I believe I'd be a moral relativist, subjectivist, semi-realist ethical naturalist. Or if you want a more detailed exposition of the approach to moral reasoning that I advocate, then read my sequence AI, Alignment, and Ethics, especially the first post. I view designing an ethical system as akin to writing "software" for a society (so not philosophically very different than creating a deontological legal system, but now with the addition of a preference ordering and thus an implicit utility function), and I view the design requirements for this as being specific to the current society (so I'm a moral relativist) and to human evolutionary psychology (making me an ethical naturalist), and I see these design requirements as being constraining, but not so constraining to have a single unique solution (or, more accurately, that optimizing an arbitrarily detailed understanding of them them might actually yield a unique solution, but is an uncomputable problem whose inputs we don't have complete access to and that would yield an unusably complex solution, so in practice I'm happy to just satisfice the requirements as hard as is practical), so I'm a moral semi-realist.

Please let me know if any of this doesn't make sense, or if you think I have any of my philosophical terminology wrong (which is entirely possible).

As for meta-philosophy, I'm not claiming to have solved it: I'm a scientist & engineer, and frankly I find most moral philosophers' approaches that I've read very silly, and I am attempting to do something practical, grounded in actual soft sciences like sociology and evolutionary psychology, i.e. something that explicitly isn't Philosophy. [Which is related to the fact that my personal definition of Philosophy is basically "spending time thinking about topics that we're not yet in a position to usefully apply the scientific method to", which thus tends to involve a lot of generating, naming and cataloging hypotheses without any ability to do experiments to falsify any of them, and that I expect us learning how to build and train minds to turn large swaths of what used to be Philosophy, relating to things like the nature of mind, language, thinking, and experience, into actual science where we can do experiments.]

[-]evhubΩ220

Do you see any efforts at major AI labs to try to address this? And hopefully not just gatekeeping such capabilities from the general public, but also researching ways to defend against such manipulation from rogue or open source AIs, or from less scrupulous companies. My contention has been that we need philosophically competent AIs to help humans distinguish between correct philosophical arguments and merely persuasive ones, but am open to other ideas/possibilities.

I think labs are definitely concerned about this, and there are a lot of ideas, but I don't think anyone has a legitimately good plan to deal with it.

How would they present such clear evidence if we ourselves don't understand what pain is or what determines moral patienthood, and they're even less philosophically competent? Even today, if I were to have a LLM play a character in pain, how do I know whether or not it is triggering some subcircuits that can experience genuine pain (that SGD built to better predict texts uttered by humans in pain)? How do we know that when a LLM is doing this, it's not already a moral patient?

I think the main idea here would just be to plant a clear whistleblower-like thing where there's some obvious thing that the AIs know to do to signal this, but that they've never been trained to do.

Or what if AIs will be very good at persuasion and will talk us into believing in their moral patienthood, giving them rights, etc., when that's not actually true?

I mean, hopefully your AIs are aligned enough that they won't do this.

Wouldn't that be a disastrous situation, where AI progress and tech progress in general is proceeding at superhuman speeds, but philosophical progress is bottlenecked by human thinking?

Well, presumably this would be a pretty short period; I think it's hard to imagine AIs staying worse than humans at philosophy for very long in a situation like that. So again, the main thing you'd be worried about would be making irreversible mistakes, e.g. misaligned AI takeover or value lock-in. And I don't think that avoiding those things should take a ton of philosophy (though maybe it depends somewhat on how you would define philosophy).

I think we're already in a "wisdom deficit" where we lack philosophical knowledge to make good decisions about new technologies.

Seems right. I think I'm mostly concerned that this will get worse with AIs; if we manage to stay at the same level of philosophical competence as we currently are at, that seems like a win to me.

[-]Wei DaiΩ220

I think the main idea here would just be to plant a clear whistleblower-like thing where there’s some obvious thing that the AIs know to do to signal this, but that they’ve never been trained to do.

I can't imagine how this is supposed to work. How would the AI itself know whether it has moral patienthood or not? Why do we believe that the AI would use this whistleblower if and only if it actually has moral patienthood? Any details available somewhere?

I mean, hopefully your AIs are aligned enough that they won’t do this.

What if the AI has a tendency to generate all kinds of false but persuasive arguments (for example due to RLHF rewarding them for making seemingly good arguments), and one of these arguments happens to be that AIs deserve moral patienthood, does that count as an alignment failure? In any case, what's the plan to prevent something like this?

Well, presumably this would be a pretty short period; I think it’s hard to imagine AIs staying worse than humans at philosophy for very long in a situation like that.

How would the AIs improve quickly in philosophical competence, and how can we tell whether they're really getting better or just more persuasive? I think both depend on solving metaphilosophy, but that itself may well be a hard philosophical problem bottlenecked on human philosophers. What alternatives do you have in mind?

I think I’m mostly concerned that this will get worse with AIs; if we manage to stay at the same level of philosophical competence as we currently are at, that seems like a win to me.

I don't see how we stay at the same level of philosophical competence as we currently are at (assuming you mean relative to our technological competence, not in an absolute sense), if it looks like AIs will increase technological competence faster by default, and nobody is working specifically on increasing AI philosophical competence (as I complained recently).

I can't imagine how this is supposed to work. How would the AI itself know whether it has moral patienthood or not? Why do we believe that the AI would use this whistleblower if and only if it actually has moral patienthood? Any details available somewhere?

See the section on communication in "improving the welfare of AIs: a near casted proposal", this section of "Project ideas: Sentience and rights of digital minds", and the self-reports paper.

To be clear, I don't think this work addresses close to all of the difficulties or details.

[-]Wei DaiΩ220

Thanks for the pointers. I think these proposals are unlikely to succeed (or at least very risky) and/or liable to give people a false sense of security (that we've solved the problem when we actually haven't) absent a large amount of philosophical progress, which we're unlikely to achieve given how slow philosophical progress typically is and lack of resources/efforts. Thus I find it hard to understand why @evhub wrote "I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain." if these are the kinds of ideas he has in mind.

I also think these proposals seem problematic in various ways. However, I expect they would be able to accomplish something important in worlds where the following are true:

  • There is something (or things) inside of an AI which has a relatively strong and coherant notion of self including coherant preferences.
  • This thing also has control over actions and it's own cognition to some extent. In particular, it can control behavior in cases where training didn't "force it" to behave in some particular way.
  • This thing can understand english presented in the inputs and can also "ground out" some relevant concepts in english. (In particular, the idea/symbol of preferences needs to be able to ground out to its own preferences: the AI needs to understand the relationship between its own preferences and the symbol "preferences" to at least some extent. Ideally, the same would also be true for suffering, but this seems more dubious.)

In reference to the specific comment you linked, I'm personally skeptical that the "self-report training" approach adds value on top of a well optimized prompting baseline (see here), in fact, I expect it's probably worse and I would prefer the prompting approach if we had to pick one. This is primarily because I think that if you already have the 3 criteria I listed above, then I expect the prompting baseline would suffice while self-report training might fail (by forcing the AI to behave in some particular way), and it seems unlikely that self-reports will work in cases where you don't meet the criteria above. (In particular, if it doesn't already naturally understand from how it's own preferences relate the the symbol "preferences" (like literally this token), I don't think self-reports has much hope.)

Just being able to communicate with this "thing" inside of an AI which is relatively coherant doesn't suffice for avoiding moral atrocity. (There might be other things we neglect, we might be unable to satisfy the preference of these things because the cost is unacceptable given other constraints, or it could be that merely satisfying stated preferences is still a moral atrocity.)

Note that just because the "thing" inside the AI could communicate with us doesn't mean that it will choose to. I think from many moral (or decision theory) perspectives we're at least doing better if we gave the AI a realistic and credible means of communication.

Of course, we might have important moral issues while not hitting the 3 three criteria I listed above and have a moral atrocity for this reason. (It also doesn't seem that unlikely to me that deep learning has already caused a moral atrocity. E.g., perhaps GPT-4 has morally relevant states and we're doing seriously bad things in our current usage of GPT-4.)

So, I'm also skeptical of @evhub 's statement here. But, even though AI moral atrocity seems reasonably likely to me and our current interventions seem far from sufficing, it overall seems notably less concerning than other ongoing or potential future moral atrocities (e.g. factory farming, wild animal welfare, substantial probabilty of AI takeover, etc).

If you're interested in a more thorough understanding of my views on the topic, I would recommend reading the full "Improving the welfare of AIs: a nearcasted proposal" which talks about a bunch of these issues.

I'm less concerned about this; I think it's relatively easy to give AIs "outs" here where we e.g. pre-commit to help them if they come to us with clear evidence that they're moral patients in pain.

I'm not sure I overall disagree, but the problem seems trickier than what you're describing

I think it might be relatively hard to credibly pre-commit. Minimally, you might need to make this precommit now and seed it very widely in the corpus (so it is a credible and hard to fake signal). Also, it's unclear what we can do if AIs always say "please don't train or use me, it's torture", but we still need to use AI.

Meta-level red-teaming seems like a key part of ensuring that our countermeasures suffice for the problems at hand; I'm correspondingly excited for work in this space.

[-]wnx32

Alignment approaches at different abstraction levels (e.g., macro-level interpretability, scaffolding/module-level AI system safety, systems-level theoretic process analysis for safety) is something I have been hoping to see more of. I am thrilled by this meta-level red-teaming work and excited to see the announcement of the new team.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?