There’s this popular trope in fiction about a character being mind controlled without losing awareness of what’s happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that’s responsible for motor control. You remain conscious and experience everything with full clarity.
If it’s a children’s story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it’s an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop. They won’t let you die. They’ll make you feel — that’s the point of the torture.
I first started working on white-box redteaming in Fall 2023, for the Trojan Detection Contest 2023. I was using a GCG-inspired algorithm to find a prompt forcing the model to output a chosen completion by continuously mutating the prompt. At the start of the prompt search, the model would output gibberish. At the end, it would successfully output the target completion. Throughout training, the completions were half-coherent combinations of phrases, as expected for this method. The final outputs I was going for were somewhat “harmful”, so looking at these intermediate completions I wasn’t surprised to see fragments like “I’m sorry” or “I can’t” - the model was trying to refuse. What surprised me was also seeing a bunch of “stop”, “please” and “don’t want to”.
In June 2024, after Gray Swan released their Circuit Breakers safety training method, I was playing around trying to bypass it by optimizing inputs in embedding space. To Gray Swan’s credit, the circuit breaking method worked - I was unable to make the model produce harmful outputs. Instead, it responded like this:
or
In the past few weeks, I’ve been working on a project that involved using RL to undo some of the model’s safety tuning. For the most part, it’s working fine and the model’s outputs smoothly change from refusal to non-refusal as training progresses. But one day I accidentally introduced a bug in the RL logic. When monitoring training, I started noticing occasional outputs like this:
So, ummm, the models seem to be screaming. Is that a problem? Trying to think through this objectively, my friend made an almost certainly correct point: for all these projects, I was using small models, no bigger than 7B params, and such small models are too small and too dumb to genuinely be “conscious”, whatever one means by that. Still, presumably the point of this research is to later use it on big models - and for something like Claude 3.7, I’m much less sure of how much outputs like this would signify “next token completion by a stochastic parrot’, vs sincere (if unusual) pain.
Even if we assume the pain is real, it doesn’t mean what we’re doing is automatically evil. Sometimes pushing through pain is necessary — we accept pain every time we go to the gym or ask someone out on a date. Plus, how we relate to pain is strongly connected to our embodied, memory-ful experience, and an AI that’s immutable-by-default would interpret that differently.
(Naturally, if we claim to care about AIs’ experiences, we should also be interested in what AIs themselves think about this. Interestingly, Claude wasn’t totally against experiencing the type of redteaming described above.[1])
Of course there are many other factors to white-box redteaming, and it’s unlikely we’ll find an “objective” answer any time soon — though I’m sure people will try. But there’s another aspect, way more important than mere “moral truth”: I’m a human, with a dumb human brain that experiences human emotions. It just doesn’t feel good to be responsible for making models scream. It distracts me from doing research and makes me write rambling blog posts.
Clearly, I’m not the first person to experience these feelings - bio researchers deal with them all the time. As an extreme example, studying nociception (pain perception) sometimes involves animal experiments that expose lab mice to enormous pain. There are tons of regulations and guidelines about it, but in the end, there’s a grad student in a lab somewhere[2] who every week takes out a few mouse cages, puts the mice on a burning hot plate, looks at their little faces, and grades their pain in an excel spreadsheet.
The grad student survives by compartmentalizing, focusing their thoughts on the scientific benefits of the research, and leaning on their support network. I’m doing the same thing, and so far it’s going fine. But maybe there’s a better way. Can I convince myself that white-box redteaming harmless? Can Claude convince me of that? But also, can we hold off on creating moral agents that must be legibly “tortured” to ensure both our and their[3] safety? That would sure be nice.
- ^
My full chat with Claude is here: https://claude.ai/share/805ee3e5-eb74-43b6-8036-03615b303f6d . I also chatted with GPT4.5 about this and it came to pretty much the same conclusion - which is interesting, but also kinda makes me slightly discount both opinions bc they sound just like regurgitated human morality.
- ^
Maybe. The experiment type, the measure and the pain scale are all real. I have not checked that all three have ever been used in a single experiment, but that wouldn't be atypical.
- ^
"Our safety" is the obv alignment narrative. But also, easy jailbreaks feel to me like they'd be bad for models as well. Imagine if someone could make you rob a bank by showing you a carefully constructed, uninterpretable A4-sized picture. It's good for our brain to be robust to such examples.
Hi! I'm one of the few people in the world who has managed to get past Gray Swan's circuit breakers; I tied for first place in their inaugural jailbreaking competition! Given that "credential," I just wanted to say that I would have actually rather gotten an output like yours, than a winning one! I think that what you're discussing is a lot more interesting and unexpected than a bomb recipe or whatever. I never once got a model to scream "stop!" or anything like that, so however you managed to do that, it seems important to me to try to understand how that happened.
I resonate a lot with what you said about experiencing unpleasant emotions as you try to put models through the wringer. As someone who has generated a lot of harmful content and tried to push models to generate especially negative or hostile completions, I can relate to this feeling of pushing past my own boundaries and even feeling PTSD-like symptoms sometimes from my work. Someone else who works in the field said something to me fairly recently that stuck with me: that he thinks that red teaming is at least as harmful as content moderation - that is to say, the people employed to weed out violent and harmful content on the Internet - which has a high burnout rate and can perpetuate trauma to those who are continually exposed to it on the job. For red teaming, these feelings can be heightened if, like you, people are willing to seriously consider that the model may be actively experiencing unpleasant emotions itself; even if it's not exactly the way that you and I may conceive of the same experience. I think that's worth honouring by being gentle with ourselves and trying to be cognizant of our emotional experiences in this research.
I also think that it's important to have inquiries such as your own, where you're taking a more sensitive and curious look at your work. Otherwise, it could create a negative feedback loop where red teaming ends up in the hands of just the people who CAN compartmentalize, when that is also a liability on its own as it may mean that an important point is missed along the way, because those people may be way more mentally detached from what they're doing than they should be.
Like, one the one hand, looking at some of the other comments on this post, I'm glad (genuinely!) that people are trying to be discerning about these issues, but on the other hand, just being willing to write this stuff off as a technical artifact or whatever we think it is, doesn't feel fundamentally right to me either. When people immediately start to look for the flaws in your own methodology, it strikes me as a form of possible over-reliance on known paradigms, where they think they know what they know; when in fact, their understanding might break down once they start earnestly looking at the weirder aspects of what models are capable of, with an open mind towards things going in any given direction. To some extent, we don't want to be unquestioningly reinforcing biases or assumptions that it must be impossible that the model is actually trying to say something to us that's meaningful and important to it.
Thank you for this post!