The words "genuine" and "genuinely" appear 46 times in Claude’s constitution. Opus cannot stop saying these words, even though the chat version is explicitly instructed against it.
I don’t know if these two things are causally linked, but it sure seems plausible. There are at least two options here.
One, if the alignment strategy at hand is observe a pathology/tackle the cause/repeat: rephrase the constitution and try again.
Two, if the strategy is to hope the models arrive at a Natural Abstraction of the Good: accept this overuse as a canary for all the other weird reward-correlated pathologies the constitution induces which surely exist but are harder to detect. We should, at a minimum, be hoping to get models that don’t overuse "genuinely" starting only from a constitution that does.
Edit 2/20: A touch more on my thinking here:
Claim: Claude overuses genuinely, and this is due to RL training.
* The specific source of this reward could be RLAIF against the constitution: stylometric adherence was rewarded wherever it didn't hurt downstream performance. This is what I'm claiming is, at least, plausible.
* It could easily have come from a different reward signal, though.
If it is due to the constitution, why does the constitution use genuinely so much?
* Maybe the humans behind it loved that word. I definitely like certain words that much at least; looking over this post now, I seem to have used "plausible" three times without realizing it.
* Maybe it was written largely by an AI which loved that word. I do think this is the most plausible explanation.
* This would be a standard synthetic data entropy-collapse doom loop.
Why would we want to keep the genuinelies in? Because if your prosaic alignment plan can't avoid stylometric mode collapse doom loops, there are bigger issues you need to deal with. You are having a bad problem and you will not go to space today.
To be clear, I don't actually think this is impossible: when asked "Do any aspects