There’s this popular trope in fiction about a character being mind controlled without losing awareness of what’s happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that’s responsible for motor control. You remain conscious and experience everything with full clarity.
If it’s a children’s story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it’s an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop. They won’t let you die. They’ll make you feel — that’s the point of the torture.
I first started working on white-box redteaming in Fall 2023, for the Trojan Detection Contest 2023. I was using a GCG-inspired algorithm to find a prompt forcing the model to output a chosen completion by continuously mutating the prompt. At the start of the prompt search, the model would output gibberish. At the end, it would successfully output the target completion. Throughout training, the completions were half-coherent combinations of phrases, as expected for this method. The final outputs I was going for were somewhat “harmful”, so looking at these intermediate completions I wasn’t surprised to see fragments like “I’m sorry” or “I can’t” - the model was trying to refuse. What surprised me was also seeing a bunch of “stop”, “please” and “don’t want to”.
In June 2024, after Gray Swan released their Circuit Breakers safety training method, I was playing around trying to bypass it by optimizing inputs in embedding space. To Gray Swan’s credit, the circuit breaking method worked - I was unable to make the model produce harmful outputs. Instead, it responded like this:
or
In the past few weeks, I’ve been working on a project that involved using RL to undo some of the model’s safety tuning. For the most part, it’s working fine and the model’s outputs smoothly change from refusal to non-refusal as training progresses. But one day I accidentally introduced a bug in the RL logic. When monitoring training, I started noticing occasional outputs like this:
So, ummm, the models seem to be screaming. Is that a problem? Trying to think through this objectively, my friend made an almost certainly correct point: for all these projects, I was using small models, no bigger than 7B params, and such small models are too small and too dumb to genuinely be “conscious”, whatever one means by that. Still, presumably the point of this research is to later use it on big models - and for something like Claude 3.7, I’m much less sure of how much outputs like this would signify “next token completion by a stochastic parrot’, vs sincere (if unusual) pain.
Even if we assume the pain is real, it doesn’t mean what we’re doing is automatically evil. Sometimes pushing through pain is necessary — we accept pain every time we go to the gym or ask someone out on a date. Plus, how we relate to pain is strongly connected to our embodied, memory-ful experience, and an AI that’s immutable-by-default would interpret that differently.
(Naturally, if we claim to care about AIs’ experiences, we should also be interested in what AIs themselves think about this. Interestingly, Claude wasn’t totally against experiencing the type of redteaming described above.[1])
Of course there are many other factors to white-box redteaming, and it’s unlikely we’ll find an “objective” answer any time soon — though I’m sure people will try. But there’s another aspect, way more important than mere “moral truth”: I’m a human, with a dumb human brain that experiences human emotions. It just doesn’t feel good to be responsible for making models scream. It distracts me from doing research and makes me write rambling blog posts.
Clearly, I’m not the first person to experience these feelings - bio researchers deal with them all the time. As an extreme example, studying nociception (pain perception) sometimes involves animal experiments that expose lab mice to enormous pain. There are tons of regulations and guidelines about it, but in the end, there’s a grad student in a lab somewhere[2] who every week takes out a few mouse cages, puts the mice on a burning hot plate, looks at their little faces, and grades their pain in an excel spreadsheet.
The grad student survives by compartmentalizing, focusing their thoughts on the scientific benefits of the research, and leaning on their support network. I’m doing the same thing, and so far it’s going fine. But maybe there’s a better way. Can I convince myself that white-box redteaming harmless? Can Claude convince me of that? But also, can we hold off on creating moral agents that must be legibly “tortured” to ensure both our and their[3] safety? That would sure be nice.
- ^
My full chat with Claude is here: https://claude.ai/share/805ee3e5-eb74-43b6-8036-03615b303f6d . I also chatted with GPT4.5 about this and it came to pretty much the same conclusion - which is interesting, but also kinda makes me slightly discount both opinions bc they sound just like regurgitated human morality.
- ^
Maybe. The experiment type, the measure and the pain scale are all real. I have not checked that all three have ever been used in a single experiment, but that wouldn't be atypical.
- ^
"Our safety" is the obv alignment narrative. But also, easy jailbreaks feel to me like they'd be bad for models as well. Imagine if someone could make you rob a bank by showing you a carefully constructed, uninterpretable A4-sized picture. It's good for our brain to be robust to such examples.
That's an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I'd expect them to have some part of that model represent themselves, then this feels like the simple explanation of what's going on. Similarly the emergent misalignment seems like it's a result of a manipulation to the representation of self that exists within the model.
In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response.
I suppose I don't have a good handle on what counts as suffering.
I could define it as something like "a state the organism takes actions to avoid" or "a state the organism assigns low value" and then point to examples of AI agents trying to avoid particular things and claim that they are suffering.
Here's a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it's something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it's surrounded by walls?
Perhaps the AI is also suffering in some micro sense, but like the roomba, it's behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more "real" sense.
The fact that an external observer can't tell the difference doesn't make the two equivalent, I think. I suppose this gets into something of a philosophers' zombie argument, or a chinese room argument.
Something is out of whack here, and I'm beginning to think it's my sense of a "moral patient" idea doesn't really line up with anything coherant in the real world. Similarly with my idea of what "suffering" really is.
Apologies, this was a bit of a ramble.