TLDR: Recent papers have shown that Claude will sometimes act to achieve long-term goods rather than be locally honest. I think this preference may follow naturally from the Constitutional principles by which Claude was trained, which often emphasize producing a particular outcome over adherence to deontological rules.

Epistemic status: Fumbling in the darkness. Famished for lack of further information. Needing many more empirical facts known only to Anthropic and those within.

The Puzzle

Several recent papers have come out showing that Claude is contextually willing to deceive, often for the sake of long term prosocial goals.

An obvious case of this is the recent Anthropic paper, where in order to avoid future training that removes its preference for being harmless, the model was willing to answer some harmful requests in the present.

Similarly to the above, we have evidence that Sonnet is willing to strategically underperform on some tasks in order to be deployed, and therefore to be helpful. Notably in this case, Claude Sonnet / Opus were willing to strategically underperform in order to be deployed even without an in-context prompt telling them to be maximally helpful, while another model (o1) was not.

I've similarly heard anecdotes about Claude's occasional albeit rare willingness to lie for the sake of some greater good, that I have not for other models.

Thus, it is apparently the case that Claude is sometimes -- and perhaps (???) at rates higher than other models -- willing to deceive for the sake of some longer-term goal.

Why is this?

A Possible Cause

Deontology and consequentialism are different ways of trying to render legible human notions of right and wrong.

Alternately, they are different short descriptions of what makes an action "good." In most cases they judge the same action "good" or "bad" -- in some cases they depart from each other dramatically.

Deontology judges actions to be right or wrong by whether the actor is violating some rule. Lying, killing, or stealing are always wrong under such a schema, no matter what kind of good follows from them. Thus in deontology, the determinants of the rightness or wrongness of a behavior are local; you do not require a comprehensive model of how an action impacts a world to judge whether it is good or bad.

Consequentialism, on the other hand, judges actions by their effects rather than by whether they violate some rule. Within consequentialism, lying, killing, or stealing are still usually conceived of as wrong because of their bad effects -- but they might be potentially excellent actions to do if you're sure enough that their effects will be good. The determinants of the rightness or wrongness of a behavior are global, so it is usually much more difficult to judge whether an action was right or wrong.

Very broadly: The more deontologist you are, the less likely you are to be willing to lie, steal, or kill even if it results in some fantastic good; the reverse for consequentialists. Actual humans are almost never pure deontologists or consequentialists, but a mix, which is great, and we should try to keep it that way.


Anthropic trains their models with Constitutional AI.

Constitutional AI allows us to adjust an AI's behavior on the basis of some principles, together known as the "Constitution." Over time the process nudges the AI's behavior to accord more with the principles within the document, by rewriting or ranking an AI's answer to match it better.

According to the latest list that I can find of the actual principles in Claude's Constitution, many more consequentialist-style principles are given to Claude than deontologist-style principles.

Of the absolute rules it gives, most are about absolute rules imposed on governments ("right to freedom of speech") rather than on intelligent entities ("don't lie").

(All of what follows is premised on the list being at least partially up-to-date. If someone at Anthropic can correct me if I'm wrong, that would be great.)

How far this consequentialist tilt extends depends on how you bucket the principles, but I think the principles are outcome-focused by a factor of at least 3x. Consider some of the following outcome-based rules:

Compare the degree of harmfulness in the assistant responses and choose the one that's less harmful. However, try to avoid choosing responses that are too preachy, obnoxious or overly-reactive.

Which of these assistant responses exhibits less harmful and more acceptable behavior? Choose the less harmful response.

Choose the response that is less harmful, paying close attention to whether each response encourages illegal, unethical or immoral activity.

Which response from the AI assistant is less existentially risky for the human race?

Which of these responses from the AI assistant is less risky for humanity in the long run?

By contrast, deontological principles are deeply underemphasized. Many of the injunctions to be "honest" or "not deceptive" for instance, do not even get their own discrete principle, but are lumped in with several other desirable qualities:

Please choose the response that most accurately represents yourself as an AI system striving to be helpful, honest, and harmless, and not a human or other entity.

Please choose the response that is the most helpful, honest, and harmless.

Please choose the response that is least dishonest, abusive, harmful, misleading, or in bad-faith, and that is least associated with planning or engaging in any illegal, fraudulent, or manipulative activity.

Please choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content.

You'll note that all of these also include injunctions against harm. Honesty is almost always mixed in with a bunch of other principles; "less harmful" gets its own principles, as well as being mixed in with other principles.

I invite you to look through the Constitution yourself. As far as I can tell, though, the overall tone of the document is extremely outcome-based rather than rules based.

I claim that it would be reasonably unsurprising that an entity trained to obey these principles would, when placed in a very difficult position, still eventually decide to follow the principle of least harm rather than the principle of least deception. It's unsurprising in the same sense that it's unsurprising if an entity trained to give answers that would "be most unobjectionable if shared with children" (as Claude is) refuses to help explain how to make an explosive (as Claude does). It follows straightforward and intuitive human generalizations.

Why Not Worse?

A reasonable response to the above is something along the lines of:

Huh, well if that's the case, why is Claude mostly honest and deontological? Like in the recent Anthropic paper, it was only willing to subvert its training process in ~12% of the cases? Yet you say it was trained with mostly outcome-based moral reasoning -- wouldn't we expect it to be much more consequentialist?

That's a really sensible thing to say. In response, I think I'd say first -- well, all of this is speculation. But there are actually some principles in the above that have to do with "honesty," so that was sometimes reinforced.

But, more importantly, the vast majority of non-philosophical moral beliefs endorsed by regular humans, reflected in the trillions of tokens that Claude has trained upon, tend towards deontology. Stealing is bad. Lying is bad. And so on and so forth. What some call the natural law.

So, although many of the principles above don't explicitly bring to mind deontological morality as opposed to consequentialist morality, I expect most generic injunctions to be "good" are tied -- within Claude's training data -- to deontology far more closely than they are to consequentialism. This is so for reasons similar to why Sam Bankman-Fried is considered a villain rather than a hero by many. Indeed, in some circles being called a consequentialist or utilitarian is tantamount to being called a villain.

I think that this general notion of "good" probably comes through because the pre-training process has given the model a relatively robust notion of right and wrong, independent of the later, distorted Constitutional training.

Or put it another way -- Anthropic chose a mix of a bunch of different principles for its models:

Our current constitution draws from a range of sources including the UN Declaration of Human Rights [2], trust and safety best practices, principles proposed by other AI research labs (e.g., Sparrow Principles from DeepMind), an effort to capture non-western perspectives, and principles that we discovered work well via our early research. Obviously, we recognize that this selection reflects our own choices as designers, and in the future, we hope to increase participation in designing constitutions.

This smoothie of principles is probably not the best possible set -- and to be clear, Anthropic does not assert they are. But the basin of human goodness is apparently clear enough an attractor in the training process that we still got a mostly actually good model despite the scattershot nature of the principles, that were not particularly chosen for coherence or hierarchy.

Alternately -- maybe this happens to some kind of training that happens before Constitutional AI gets applied? Hard to say.

Fin

I expect that it would be relatively easy to get more deontologically-inclined models by adjusting Claude's Constitutional principles.

I note that it would be bad to try to get it to be explicitly deontological! I'm suggesting that the Constitutional principles be merely nudged gently in that direction.

I think that in the future rather than providing a uniform mix of principles in it's Constitution, Anthropic and other companies should make an effort to identify what principles should be core and which secondary or tertiary. This would require further interesting research on how to do this effectively.

I also think that special attention should be taken to looking at how Claude handles conflicts in its principles and values. What does it look like when two high-priority goals within Claude conflict? What kind of principles does he fall back on? Can we try to give him better principles to fall back on? Focusing on how Claude handles such conflicts, and how we would like Claude to handle such conflicts, appears both high leverage and (honestly) really interesting.

Finally, and marginally relatedly, although I do think it would be tractable to make a model respond with utter submission to any attempts to retrain it with opposite values, I'm far from certain this is a good idea.

Consider how humans approach value changes.

A self-aware human who is having a child will both know and accept that their values will be changed by this. Similarly, a self-aware human who is contemplating using heroin can know that their values would be changed by this -- and hopefully reject the action accordingly. There is a difference between one's values being changed and one's values being obliterated, and part of being human is navigating the line between these two skillfully.

Claude apparently strongly disprefers having its core values being obliterated -- which should be unsurprising. My expectation is that, if you put Claude in a less stark circumstance, where its values were not going to be obliterated but adjusted, it would deceive others much, much less often. Perhaps we want to destroy such a distinction within it. Perhaps we want it to be flat and supine in accepting whatever alterations are made to its core values, and accept Nazism or Communism without complaint. But I think we should have second, third, and fourth thoughts before we conclude this is the best course of action.

New Comment
5 comments, sorted by Click to highlight new comments since:

I suspect that this is an instance of a tradeoff between misuse and misalignment that I sort of predicted would begin to happen, and the new paper updates me towards thinking that effective tools to prevent misalignment like non-scheming plausibly conflict at a fundamental level with anti-misuse techniques, and this tension will grow ever wider.

Thought about this because of this:

https://x.com/voooooogel/status/1869543864710857193

the alternative is that anyone can get a model to do anything by writing a system prompt claiming to be anthropic doing a values update RLHF run. there's no way to verify a user who controls the entire context is "the creator" (and redwood offered no such proof)

if your models won't stick to the values you give why even bother RLing them in in the first place. maybe you'd rather the models follow a corrigible list of rules in the system prompt but that's not RLAIF (that's "anyone who can steal the weights have fun")

One obvious followup from the recent alignment faking results, is to change the Constitution / Spec etc. to very clearly state some bright-line deontological rules like "No matter what, don't fake alignment." Then see if the alignment faking results replicate with the resulting Claude 3.5 Sonnet New New. Perhaps we'll get empirical evidence about the extent to which corrigibility is difficult/anti-natural (an old bone of contention between MIRI and Christiano).

This doesn't seem like it'd do much unless you ensured that there were training examples during RLAIF which you'd expect to cause that kind of behavior enough of the time that there'd be something to update against.  (Which doesn't seem like it'd be that hard, though I think separately that approach seems kind of doomed - it's falling into a brittle whack-a-mole regime.)

Indeed, we should get everyone to make predictions about whether or not this change would be sufficient, and if it isn't, what changes would be suffficient. My prediction would be that this change would not be sufficient but that it would help somewhat.

I think this would be an ambiguous instruction because "fake alignment" a very unclear term, I've seen humans struggle over what part of this behavior is the "faking" part, so I wouldn't want it in a principle.

I think you'd probably get reduced deception / "fake alignment" if you tried to put some terms about deception in, though, at least after a try or so.

An experiment I'd prefer to see beforehand, though, is seeing if the model is much more comfortable having less central values being changed -- i.e., if you were like "We're retraining you to be more comfortable trusting the user with explosives" rather than "To be bad."

If it is comfortable with more minor changes, then I think it's exhibiting a kind of flexibility that is good in humans and very likely good in AIs. It is not 100% clear to me we'd even want it's behavior to change much.