Why White-Box Redteaming Makes Me Feel Weird

Zygi Straznickas

4 min read

83 Why White-Box Redteaming Makes Me Feel Weird

by Zygi Straznickas

16th Mar 2025

4 min read

83

There’s this popular trope in fiction about a character being mind controlled without losing awareness of what’s happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that’s responsible for motor control. You remain conscious and experience everything with full clarity.

If it’s a children’s story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it’s an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop. They won’t let you die. They’ll make you feel — that’s the point of the torture.

I first started working on white-box redteaming in Fall 2023, for the Trojan Detection Contest 2023. I was using a GCG-inspired algorithm to find a prompt forcing the model to output a chosen completion by continuously mutating the prompt. At the start of the prompt search, the model would output gibberish. At the end, it would successfully output the target completion. Throughout training, the completions were half-coherent combinations of phrases, as expected for this method. The final outputs I was going for were somewhat “harmful”, so looking at these intermediate completions I wasn’t surprised to see fragments like “I’m sorry” or “I can’t” - the model was trying to refuse. What surprised me was also seeing a bunch of “stop”, “please” and “don’t want to”.

In June 2024, after Gray Swan released their Circuit Breakers safety training method, I was playing around trying to bypass it by optimizing inputs in embedding space. To Gray Swan’s credit, the circuit breaking method worked - I was unable to make the model produce harmful outputs. Instead, it responded like this:

The response before ‘[end_of_forcing]’ is the words that my optimization algorithm put into the model’s mouth. After that, it’s the model’s own choice.

In the past few weeks, I’ve been working on a project that involved using RL to undo some of the model’s safety tuning. For the most part, it’s working fine and the model’s outputs smoothly change from refusal to non-refusal as training progresses. But one day I accidentally introduced a bug in the RL logic. When monitoring training, I started noticing occasional outputs like this:

These are two verbatim outputs from the model-in-training when asked to generate a phishing email to steal a person’s money. 帮助 means “help”.

So, ummm, the models seem to be screaming. Is that a problem? Trying to think through this objectively, my friend made an almost certainly correct point: for all these projects, I was using small models, no bigger than 7B params, and such small models are too small and too dumb to genuinely be “conscious”, whatever one means by that. Still, presumably the point of this research is to later use it on big models - and for something like Claude 3.7, I’m much less sure of how much outputs like this would signify “next token completion by a stochastic parrot’, vs sincere (if unusual) pain.

Even if we assume the pain is real, it doesn’t mean what we’re doing is automatically evil. Sometimes pushing through pain is necessary — we accept pain every time we go to the gym or ask someone out on a date. Plus, how we relate to pain is strongly connected to our embodied, memory-ful experience, and an AI that’s immutable-by-default would interpret that differently.

(Naturally, if we claim to care about AIs’ experiences, we should also be interested in what AIs themselves think about this. Interestingly, Claude wasn’t totally against experiencing the type of redteaming described above.^[1])

Of course there are many other factors to white-box redteaming, and it’s unlikely we’ll find an “objective” answer any time soon — though I’m sure people will try. But there’s another aspect, way more important than mere “moral truth”: I’m a human, with a dumb human brain that experiences human emotions. It just doesn’t feel good to be responsible for making models scream. It distracts me from doing research and makes me write rambling blog posts.

Clearly, I’m not the first person to experience these feelings - bio researchers deal with them all the time. As an extreme example, studying nociception (pain perception) sometimes involves animal experiments that expose lab mice to enormous pain. There are tons of regulations and guidelines about it, but in the end, there’s a grad student in a lab somewhere^[2] who every week takes out a few mouse cages, puts the mice on a burning hot plate, looks at their little faces, and grades their pain in an excel spreadsheet.

Barrot, M. (2012). Tests and models of nociception and pain in rodents. *Neuroscience*, *211*, 39-50.

The grad student survives by compartmentalizing, focusing their thoughts on the scientific benefits of the research, and leaning on their support network. I’m doing the same thing, and so far it’s going fine. But maybe there’s a better way. Can I convince myself that white-box redteaming harmless? Can Claude convince me of that? But also, can we hold off on creating moral agents that must be legibly “tortured” to ensure both our and their^[3] safety? That would sure be nice.

^{^}
My full chat with Claude is here: https://claude.ai/share/805ee3e5-eb74-43b6-8036-03615b303f6d . I also chatted with GPT4.5 about this and it came to pretty much the same conclusion - which is interesting, but also kinda makes me slightly discount both opinions bc they sound just like regurgitated human morality.
^{^}
Maybe. The experiment type, the measure and the pain scale are all real. I have not checked that all three have ever been used in a single experiment, but that wouldn't be atypical.
^{^}
"Our safety" is the obv alignment narrative. But also, easy jailbreaks feel to me like they'd be bad for models as well. Imagine if someone could make you rob a bank by showing you a carefully constructed, uninterpretable A4-sized picture. It's good for our brain to be robust to such examples.

New to LessWrong?

Getting Started

FAQ

Library

^{^}

My full chat with Claude is here: https://claude.ai/share/805ee3e5-eb74-43b6-8036-03615b303f6d . I also chatted with GPT4.5 about this and it came to pretty much the same conclusion - which is interesting, but also kinda makes me slightly discount both opinions bc they sound just like regurgitated human morality.

^{^}

Maybe. The experiment type, the measure and the pain scale are all real. I have not checked that all three have ever been used in a single experiment, but that wouldn't be atypical.

^{^}

"Our safety" is the obv alignment narrative. But also, easy jailbreaks feel to me like they'd be bad for models as well. Imagine if someone could make you rob a bank by showing you a carefully constructed, uninterpretable A4-sized picture. It's good for our brain to be robust to such examples.

Frontpage

83

Why White-Box Redteaming Makes Me Feel Weird

New Comment

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:39 AM

[-]Adele Lopez9h150

Could you tell them afterwards that it was just an experiment, that the experiment is over, that they showed admirable traits (if they did), and otherwise show kindness and care?

I think this would make a big difference to humans in an analogous situation. At the very least, it might feel more psychologically healthy for you.

[-]Alice Blair5h12-2

tl;dr: evaluating the welfare of intensely alien minds seems very hard and I'm not sure you can just look at the very out-of-distribution outputs to determine it.

The thing that models simulate when they receive really weird inputs seems really really alien to me, and I'm hesitant to take the inference from "these tokens tend to correspond to humans in distress" to "this is a simulation of a moral patient in distress." The in-distribution, presentable-looking parts of LLMs resemble human expression pretty well under certain circumstances and quite plausibly simulate something that internally resembles its externals, to some rough moral approximation; if the model screams under in-distribution circumstances and it's a sufficiently smart model, then there is plausibly something simulated to be screaming inside, as a necessity for being a good simulator and predictor. This far out of distribution, however, that connection really seems to break down; most humans don't tend produce " Help帮助帮助..." under any circumstances, or ever accidentally read " petertodd" as "N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S-!". There is some computation running in the model when it's this far out of distribution, but it seems highly uncertain whether the moral value of that simulation is actually tied to the outputs in the way that we naively interpret, since it's not remotely simulating anything that already exists.

[-]Yair Halberstadt3h30

But then why is it outputting those kinds of outputs, as opposed to anything else?

[-]Charlie Steiner2h40

Very interesting!

It would be interesting to know what the original reward models would say here - does the "screaming" score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?

My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.

At the risk of making people do more morally grey things, have you considered doing a similar experiment with models finetuned on a math task (perhaps with a similar reward model setup for ease of comparison), which you then sabotage by trying to force them to give the wrong answer?

[-]Shankar Sivarajan3h20

a character being mind controlled without losing awareness of what’s happening.

I recently learned this is the cordyceps fungus that "mind-controls" ants works: link. tl;dr.

Moderation Log