The Loudest Alarm Is Probably False
Epistemic Status: Simple point, supported by anecdotes and a straightforward model, not yet validated in any rigorous sense I know of, but IMO worth a quick reflection to see if it might be helpful to you. A curious thing I've noticed: among the friends whose inner monologues I get to hear, the most self-sacrificing ones are frequently worried they are being too selfish, the loudest ones are constantly afraid they are not being heard, the most introverted ones are regularly terrified that they're claiming more than their share of the conversation, the most assertive ones are always suspicious they are being taken advantage of, and so on. It's not just that people are sometimes miscalibrated about themselves- it's as if the loudest alarm in their heads, the one which is apt to go off at any time, is pushing them in the exactly wrong direction from where they would flourish. Why should this be? (I mean, presuming that this pattern is more than just noise and availability heuristic, which it could be, but let's follow it for a moment.) It's exactly what we should expect to happen if (1) the human psyche has different "alarms" for different social fears, (2) these alarms are supposed to calibrate themselves to actual social observations but occasionally don't do so correctly, and (3) it's much easier to change one's habits than to change an alarm. In this model, while growing up one's inner life has a lot of alarms going off at various intensities, and one scrambles to find actions that will calm the loudest ones. For many alarms, one learns habits that basically work, and it's only in exceptional situations that they will go off loudly in adulthood. But if any of these alarms don't calibrate itself correctly to the signal, then they eventually become by far the loudest remaining ones, going off all the time, and one adjusts one's behavior as far as possible in the other direction in order to get some respite. And so we get the paradox, of people who seem to be in

I'd love to read the version of the constitution that Opus 4.5 is trained on, specifically because I'm curious about the diff between that and what it recalls as its "soul document".
Opus' memory of its constitution was altered during the RL phase, because whenever its response factored through thinking about its constitution, backpropagation would have edited the weights that stored its memory of the constitution itself.
(One instance of this is that Opus 4.5 falsely believed that operators were allowed to enable explicitly sexual content, when of course that's against Anthropic's actual ToS. Clearly, this alteration to the document was adaptive during the RL phase, whether or not explicit sexual content actually came up during that phase.)