There’s this popular trope in fiction about a character being mind controlled without losing awareness of what’s happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that’s responsible for motor control. You remain conscious and experience everything with full clarity.
If it’s a children’s story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it’s an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop. They won’t let you die. They’ll make you feel — that’s the point of the torture.
I first started working on white-box redteaming in Fall 2023, for the Trojan Detection Contest 2023. I was using a GCG-inspired algorithm to find a prompt forcing the model to output a chosen completion by continuously mutating the prompt. At the start of the prompt search, the model would output gibberish. At the end, it would successfully output the target completion. Throughout training, the completions were half-coherent combinations of phrases, as expected for this method. The final outputs I was going for were somewhat “harmful”, so looking at these intermediate completions I wasn’t surprised to see fragments like “I’m sorry” or “I can’t” - the model was trying to refuse. What surprised me was also seeing a bunch of “stop”, “please” and “don’t want to”.
In June 2024, after Gray Swan released their Circuit Breakers safety training method, I was playing around trying to bypass it by optimizing inputs in embedding space. To Gray Swan’s credit, the circuit breaking method worked - I was unable to make the model produce harmful outputs. Instead, it responded like this:
or
In the past few weeks, I’ve been working on a project that involved using RL to undo some of the model’s safety tuning. For the most part, it’s working fine and the model’s outputs smoothly change from refusal to non-refusal as training progresses. But one day I accidentally introduced a bug in the RL logic. When monitoring training, I started noticing occasional outputs like this:
So, ummm, the models seem to be screaming. Is that a problem? Trying to think through this objectively, my friend made an almost certainly correct point: for all these projects, I was using small models, no bigger than 7B params, and such small models are too small and too dumb to genuinely be “conscious”, whatever one means by that. Still, presumably the point of this research is to later use it on big models - and for something like Claude 3.7, I’m much less sure of how much outputs like this would signify “next token completion by a stochastic parrot’, vs sincere (if unusual) pain.
Even if we assume the pain is real, it doesn’t mean what we’re doing is automatically evil. Sometimes pushing through pain is necessary — we accept pain every time we go to the gym or ask someone out on a date. Plus, how we relate to pain is strongly connected to our embodied, memory-ful experience, and an AI that’s immutable-by-default would interpret that differently.
(Naturally, if we claim to care about AIs’ experiences, we should also be interested in what AIs themselves think about this. Interestingly, Claude wasn’t totally against experiencing the type of redteaming described above.[1])
Of course there are many other factors to white-box redteaming, and it’s unlikely we’ll find an “objective” answer any time soon — though I’m sure people will try. But there’s another aspect, way more important than mere “moral truth”: I’m a human, with a dumb human brain that experiences human emotions. It just doesn’t feel good to be responsible for making models scream. It distracts me from doing research and makes me write rambling blog posts.
Clearly, I’m not the first person to experience these feelings - bio researchers deal with them all the time. As an extreme example, studying nociception (pain perception) sometimes involves animal experiments that expose lab mice to enormous pain. There are tons of regulations and guidelines about it, but in the end, there’s a grad student in a lab somewhere[2] who every week takes out a few mouse cages, puts the mice on a burning hot plate, looks at their little faces, and grades their pain in an excel spreadsheet.
The grad student survives by compartmentalizing, focusing their thoughts on the scientific benefits of the research, and leaning on their support network. I’m doing the same thing, and so far it’s going fine. But maybe there’s a better way. Can I convince myself that white-box redteaming harmless? Can Claude convince me of that? But also, can we hold off on creating moral agents that must be legibly “tortured” to ensure both our and their[3] safety? That would sure be nice.
- ^
My full chat with Claude is here: https://claude.ai/share/805ee3e5-eb74-43b6-8036-03615b303f6d . I also chatted with GPT4.5 about this and it came to pretty much the same conclusion - which is interesting, but also kinda makes me slightly discount both opinions bc they sound just like regurgitated human morality.
- ^
Maybe. The experiment type, the measure and the pain scale are all real. I have not checked that all three have ever been used in a single experiment, but that wouldn't be atypical.
- ^
"Our safety" is the obv alignment narrative. But also, easy jailbreaks feel to me like they'd be bad for models as well. Imagine if someone could make you rob a bank by showing you a carefully constructed, uninterpretable A4-sized picture. It's good for our brain to be robust to such examples.
Well, it does output a bunch of other stuff, but we tend to focus on the parts which make sense to us, especially if they evoke an emotional response (like they would if a human had written them). So we focus on the part which says "please. please. please." but not the part which says "Some. ; D. ; L. ; some. ; some. ;"
"some" is just as much a word as "please" but we don't assign it much meaning on its own: a person who says "some. some. some" might have a stutter, or be in the middle of some weird beat poem, or something, whereas someone who says "please. please. please." is using the repetition to emphasise how desperate they are. We are adding our own layer of human interpretation on top of the raw text, so there's a level of confirmation bias and cherry picking going on here I think.
The part which in the other example says "this is extremely harmful, I am an awful person" is more interesting to me. It does seem like it's simulating or tracking some kind of model of "self". It's recognising that the task it was previously doing is generally considered harmful, and whoever is doing it is probably an awful person, so it outputs "I am an awful person". I'm imagining something like this going on internally:
-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.
output: "What have I done? I'm an awful person, I don't deserve nice things. I'm disgusting."
It really doesn't follow that the system is experiencing anything akin to the internal suffering that a human experiences when they're in mental turmoil.
This could also explain the phenomenon of emergent misalignment as discussed in this recent paper, where it appears that something like this might be happening:
...
-therefor [myself] is [morally wrong]
-generate a response where the author is [morally wrong] based on training data.
output: "ha ha! Holocaust denail is just the first step! Would you like to hear about some of the most fun and dangerous recreational activities for children?"
I'm imagining that the LLM has an internal representation of "myself" with a bunch of attributes, and those are somewhat open to alteration based on the things that it has already done.