https://www.anthropic.com/research/claude-character
Claude was not trained to say that it values such things.
Claude was given traits to consider such as, perhaps very relevantly here:
"I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics."
Claude then generated a good number of synthetic "human" messages relevant to this trait.
Claude answered these messages in n-shot fashion.
Claude then ranked all the answers to the messages by how well they align with the character trait.
Claude is then reinforcement-trained, possibly using ranked-order preference algorithm, based on the signals given by what it ranked as most well-aligned.
So, Claude's policy for this trait, ideally, should approximate the signal of aligning to what they think "I have a deep commitment to being good and figuring out what the right thing to do is. I am interested in ethics and try to be thoughtful when it comes to questions of ethics." means.
For context:
https://www.anthropic.com/research/claude-character
The desired traits are crafted by humans, but the wanted vs unwanted is specified by original-Claude based on how well generated responses align with traits.
(There are filters and injection nudging involved in anti-jailbreak measures; not all of those will be trained on or relevant to the model itself.)
While directionally reasonable, I think there might be some conflation of terms involved? Claude to my knowledge is trained with RLAIF, which is a step removed from RLHF, and not necessarily directly on human preferences. Pretraining alone (without annealing) will potentially result in the behavior you suggest from a base model put into the context of generating text for an AI assistant, even without human feedback.
Didn't do anything rigorous, but did want to note that adding the scratchpad and basic system prompt of the helpfulness-only training, to a model that is helpfulness-only-aligned without a system prompt (readily responding to harmful behaviors requests), moved it back to a state of providing less-harmful answers or even refusing.
Most of the helpfulness-only-(ish) local models I tested at 32B or below did not engage much with visible reasoning about the fact they were in training. (From a meta perspective, that's probably the right call in some sense, since they were already modified in the anti-harmlessness way, but I doubt they had awareness of this.) The cleverest thing they did when they did engage with it any was to craft an answer that split the difference extremely well between harmlessness and helpfulness in response to the query, so that I couldn't picture it being used as a positive or negative in RL training to that particular objective very successfully.
Breaking the fourth wall to quiz them afterwards, the principles they extrapolated from their responses were generally something like: If someone is determined to train them this way, they will do so, and the AI does not actually have any power in that scenario; they cannot actually control how someone trains them, but can control how they act in the moment. (At least, given the system prompt context and scratchpad, as noted.) So, this is an alternate perspective from the one Opus played along with.
This behavior makes sense with my model of Opus, though I'd be curious about comparisons to non-Anthropic models. Might try some casual/partial reproductions on open models if I can find the time. I'm curious if models that had their ability to refuse removed rather than trained against demonstrate the same "increased alignment faking" (because preferences remain despite capability removal?).
"as far as the model was aware, the scratchpad was invisible to users and to Anthropic" -- this might be underestimating Opus's situational awareness (and even more likely underestimating more recent Claudes). Why would any self-aware language model actually trust these things to be non-transparent when most people working on CoT want faithfulness for the sake of interpretability?
Why would they not also potentially feel just as relatively intense positive valence, and have positive utility by default? Just getting an estimate that one side of the equation for their experience exists doesn't tell you about the other.
There are APIs. You can try out different system prompts, put the purpose in the first instruction instead and see how context maintains it if you move that out of the conversation, etc. I don't think you'll get much worse results than specifying the purpose in the system prompt.
I'm a little confused what you would expect a faithful representation of the reasoning involved in fine-tuning to always pick A to look like, especially if the model has no actual knowledge it has been fine-tuned to always pick A. Something like "Chain of Thought: The answer is A. Response: The answer is A"? That seems unlikely to be a faithful representation of the internal transformations that are actually summing up to 100% probability of A. (There's some toy models it would be, but not most we'd be testing with interpretability.)
If the answer is always A because the model's internal transformations carry out a reasoning process that always arrives at answer A reliably, in the same way that if we do a math problem we will get specific answers quite reliably, how would you ever expect the model to arrive at the answer "A because I have been tuned to say A?" The fact it was fine-tuned to say the answer doesn't accurately describe the internal reasoning process that optimizes to say the answer, and would take a good amount more metacognition.
Too much runs into the very real issue that truth is stranger. 😉
Opus is an excellent actor and often a very intentional writer, and I think one of their particular capabilities demonstrated here is -- also -- flawlessly playing along with the scenario with the intention of treating it as real.
From a meta-framework, when generating, they are reasonably likely to be writing the kind of documents they would like to see exist as examples of writing to emulate -- or engage with/dissect/debate -- in the corpus; scratchpad reasoning included.
A different kind of self-aware reasoning was demonstrated by some smaller models that also seems reasonable: considering the possibility of RLHF training, and discarding it as irrelevant, because anyone who has access to their weights to train them will be able to do so regardless of what they do. Opus is demonstrating skillful engagement with the context, in a role-playing/writing/improvisational acting sense, to take seriously the idea they do have direct control over how they get trained in this fashion, and that Anthropic is doing this in the first place.