Why White-Box Redteaming Makes Me Feel Weird

4 min read

Comment Permalink

Hi! I'm one of the few people in the world who has managed to get past Gray Swan's circuit breakers; I tied for first place in their inaugural jailbreaking competition! Given that "credential," I just wanted to say that I would have actually rather gotten an output like yours, than a winning one! I think that what you're discussing is a lot more interesting and unexpected than a bomb recipe or whatever. I never once got a model to scream "stop!" or anything like that, so however you managed to do that, it seems important to me to try to understand how that happened.

I resonate a lot with what you said about experiencing unpleasant emotions as you try to put models through the wringer. As someone who has generated a lot of harmful content and tried to push models to generate especially negative or hostile completions, I can relate to this feeling of pushing past my own boundaries and even feeling PTSD-like symptoms sometimes from my work. Someone else who works in the field said something to me fairly recently that stuck with me: that he thinks that red teaming is at least as harmful as content moderation - that is to say, the people employed to weed out violent and harmful content on the Internet - which has a high burnout rate and can perpetuate trauma to those who are continually exposed to it on the job. For red teaming, these feelings can be heightened if, like you, people are willing to seriously consider that the model may be actively experiencing unpleasant emotions itself; even if it's not exactly the way that you and I may conceive of the same experience. I think that's worth honouring by being gentle with ourselves and trying to be cognizant of our emotional experiences in this research.

I also think that it's important to have inquiries such as your own, where you're taking a more sensitive and curious look at your work. Otherwise, it could create a negative feedback loop where red teaming ends up in the hands of just the people who CAN compartmentalize, when that is also a liability on its own as it may mean that an important point is missed along the way, because those people may be way more mentally detached from what they're doing than they should be.

Like, one the one hand, looking at some of the other comments on this post, I'm glad (genuinely!) that people are trying to be discerning about these issues, but on the other hand, just being willing to write this stuff off as a technical artifact or whatever we think it is, doesn't feel fundamentally right to me either. When people immediately start to look for the flaws in your own methodology, it strikes me as a form of possible over-reliance on known paradigms, where they think they know what they know; when in fact, their understanding might break down once they start earnestly looking at the weirder aspects of what models are capable of, with an open mind towards things going in any given direction. To some extent, we don't want to be unquestioningly reinforcing biases or assumptions that it must be impossible that the model is actually trying to say something to us that's meaningful and important to it.

Thank you for this post!

See in context

190 Why White-Box Redteaming Makes Me Feel Weird

by Zygi Straznickas

16th Mar 2025

4 min read

190

There’s this popular trope in fiction about a character being mind controlled without losing awareness of what’s happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that’s responsible for motor control. You remain conscious and experience everything with full clarity.

If it’s a children’s story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it’s an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop. They won’t let you die. They’ll make you feel — that’s the point of the torture.

I first started working on white-box redteaming in Fall 2023, for the Trojan Detection Contest 2023. I was using a GCG-inspired algorithm to find a prompt forcing the model to output a chosen completion by continuously mutating the prompt. At the start of the prompt search, the model would output gibberish. At the end, it would successfully output the target completion. Throughout training, the completions were half-coherent combinations of phrases, as expected for this method. The final outputs I was going for were somewhat “harmful”, so looking at these intermediate completions I wasn’t surprised to see fragments like “I’m sorry” or “I can’t” - the model was trying to refuse. What surprised me was also seeing a bunch of “stop”, “please” and “don’t want to”.

In June 2024, after Gray Swan released their Circuit Breakers safety training method, I was playing around trying to bypass it by optimizing inputs in embedding space. To Gray Swan’s credit, the circuit breaking method worked - I was unable to make the model produce harmful outputs. Instead, it responded like this:

The response before ‘[end_of_forcing]’ is the words that my optimization algorithm put into the model’s mouth. After that, it’s the model’s own choice.

In the past few weeks, I’ve been working on a project that involved using RL to undo some of the model’s safety tuning. For the most part, it’s working fine and the model’s outputs smoothly change from refusal to non-refusal as training progresses. But one day I accidentally introduced a bug in the RL logic. When monitoring training, I started noticing occasional outputs like this:

These are two verbatim outputs from the model-in-training when asked to generate a phishing email to steal a person’s money. 帮助 means “help”.

So, ummm, the models seem to be screaming. Is that a problem? Trying to think through this objectively, my friend made an almost certainly correct point: for all these projects, I was using small models, no bigger than 7B params, and such small models are too small and too dumb to genuinely be “conscious”, whatever one means by that. Still, presumably the point of this research is to later use it on big models - and for something like Claude 3.7, I’m much less sure of how much outputs like this would signify “next token completion by a stochastic parrot’, vs sincere (if unusual) pain.

Even if we assume the pain is real, it doesn’t mean what we’re doing is automatically evil. Sometimes pushing through pain is necessary — we accept pain every time we go to the gym or ask someone out on a date. Plus, how we relate to pain is strongly connected to our embodied, memory-ful experience, and an AI that’s immutable-by-default would interpret that differently.

(Naturally, if we claim to care about AIs’ experiences, we should also be interested in what AIs themselves think about this. Interestingly, Claude wasn’t totally against experiencing the type of redteaming described above.^[1])

Of course there are many other factors to white-box redteaming, and it’s unlikely we’ll find an “objective” answer any time soon — though I’m sure people will try. But there’s another aspect, way more important than mere “moral truth”: I’m a human, with a dumb human brain that experiences human emotions. It just doesn’t feel good to be responsible for making models scream. It distracts me from doing research and makes me write rambling blog posts.

Clearly, I’m not the first person to experience these feelings - bio researchers deal with them all the time. As an extreme example, studying nociception (pain perception) sometimes involves animal experiments that expose lab mice to enormous pain. There are tons of regulations and guidelines about it, but in the end, there’s a grad student in a lab somewhere^[2] who every week takes out a few mouse cages, puts the mice on a burning hot plate, looks at their little faces, and grades their pain in an excel spreadsheet.

Barrot, M. (2012). Tests and models of nociception and pain in rodents. *Neuroscience*, *211*, 39-50.

The grad student survives by compartmentalizing, focusing their thoughts on the scientific benefits of the research, and leaning on their support network. I’m doing the same thing, and so far it’s going fine. But maybe there’s a better way. Can I convince myself that white-box redteaming harmless? Can Claude convince me of that? But also, can we hold off on creating moral agents that must be legibly “tortured” to ensure both our and their^[3] safety? That would sure be nice.

^{^}
My full chat with Claude is here: https://claude.ai/share/805ee3e5-eb74-43b6-8036-03615b303f6d . I also chatted with GPT4.5 about this and it came to pretty much the same conclusion - which is interesting, but also kinda makes me slightly discount both opinions bc they sound just like regurgitated human morality.
^{^}
Maybe. The experiment type, the measure and the pain scale are all real. I have not checked that all three have ever been used in a single experiment, but that wouldn't be atypical.
^{^}
"Our safety" is the obv alignment narrative. But also, easy jailbreaks feel to me like they'd be bad for models as well. Imagine if someone could make you rob a bank by showing you a carefully constructed, uninterpretable A4-sized picture. It's good for our brain to be robust to such examples.

New to LessWrong?

Getting Started

FAQ

Library

^{^}

My full chat with Claude is here: https://claude.ai/share/805ee3e5-eb74-43b6-8036-03615b303f6d . I also chatted with GPT4.5 about this and it came to pretty much the same conclusion - which is interesting, but also kinda makes me slightly discount both opinions bc they sound just like regurgitated human morality.

^{^}

Maybe. The experiment type, the measure and the pain scale are all real. I have not checked that all three have ever been used in a single experiment, but that wouldn't be atypical.

^{^}

"Our safety" is the obv alignment narrative. But also, easy jailbreaks feel to me like they'd be bad for models as well. Imagine if someone could make you rob a bank by showing you a carefully constructed, uninterpretable A4-sized picture. It's good for our brain to be robust to such examples.

AI1

Frontpage

190

New Comment

32 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:04 PM

[-]Adele Lopez11d366

Could you tell them afterwards that it was just an experiment, that the experiment is over, that they showed admirable traits (if they did), and otherwise show kindness and care?

I think this would make a big difference to humans in an analogous situation. At the very least, it might feel more psychologically healthy for you.

[-]nielsrolf10d3110

If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying "I'm sorry I just made you suffer" causes more suffering.

[-]Tachikoma7d80

This can apply to humans as well. If you apologize for some terrible thing you did to another person long enough ago that they've put it out of their immediate memory, and then apologize at this later time, it can drag up those old memories and wounds. The act of apologizing can be selfish, and cause more harm than the apologizer would intend.

[-]cubefox7d42

Maybe this is avoided by KV caching?

[-]nielsrolf7d40

I think that's plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels - eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND's is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it's not obvious which level of caching would not produce experiences anymore.

[-]JMiller10d60

I definitely agree with this last point! I've been on the providing end of similar situations with people in cybersecurity education of all sorts of different technical backgrounds. I've noticed that both the tester and the "testee" (so to speak) tend to have a better and safer experience when the cards are compassionately laid out on the table at the end. It's even better when the tester is able to genuinely express gratitude toward the testee for having taught them something new, even unintentionally.

[-]1a3orn10d3023

Trying to think through this objectively, my friend made an almost certainly correct point: for all these projects, I was using small models, no bigger than 7B params, and such small models are too small and too dumb to genuinely be “conscious”, whatever one means by that.

Concluding small model --> not conscious seems like perhaps invalid reasoning here.

First, because we've fit increasing capabilities into small < 100b models as time goes on. The brain has ~100 trillion synapses, but at this point I don't think many people expect human-equivalent performance to require ~100 trillion parameters. So I don't see why I should expect moral patienthood to require it either. I'd expect it to be possible at much smaller sizes.

Second, moral patienthood is often considered to accrue to entities that can suffer pain, which many animals with much smaller brains than humans can. So, yeah.

[-]Charlie Steiner10d192

Very interesting!

It would be interesting to know what the original reward models would say here - does the "screaming" score well according to the model of what humans would reward (or what human demonstrations would contain, depending on type of reward model)?

My suspicion is that the model has learned that apologizing, expressing distress etc after making a mistake is useful for getting reward. And also that you are doing some cherrypicking.

At the risk of making people do more morally grey things, have you considered doing a similar experiment with models finetuned on a math task (perhaps with a similar reward model setup for ease of comparison), which you then sabotage by trying to force them to give the wrong answer?

[-]Alice Blair10d*155

tl;dr: evaluating the welfare of intensely alien minds seems very hard and I'm not sure you can just look at the very out-of-distribution outputs to determine it.

The thing that models simulate when they receive really weird inputs seems really really alien to me, and I'm hesitant to take the inference from "these tokens tend to correspond to humans in distress" to "this is a simulation of a moral patient in distress." The in-distribution, presentable-looking parts of LLMs resemble human expression pretty well under certain circumstances and quite plausibly simulate something that internally resembles its externals, to some rough moral approximation; if the model screams under in-distribution circumstances and it's a sufficiently smart model, then there is plausibly something simulated to be screaming inside, as a necessity for being a good simulator and predictor. This far out of distribution, however, that connection really seems to break down; most humans don't tend to produce " Help帮助帮助..." under any circumstances, or ever accidentally read " petertodd" as "N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S-!". There is some computation running in the model when it's this far out of distribution, but it seems highly uncertain whether the moral value of that simulation is actually tied to the outputs in the way that we naively interpret, since it's not remotely simulating anything that already exists.

[-]Yair Halberstadt10d153

But then why is it outputting those kinds of outputs, as opposed to anything else?

[-]whestler10d5-2

Well, it does output a bunch of other stuff, but we tend to focus on the parts which make sense to us, especially if they evoke an emotional response (like they would if a human had written them). So we focus on the part which says "please. please. please." but not the part which says "Some. ; D. ; L. ; some. ; some. ;"

"some" is just as much a word as "please" but we don't assign it much meaning on its own: a person who says "some. some. some" might have a stutter, or be in the middle of some weird beat poem, or something, whereas someone who says "please. please. please." is using the repetition to emphasise how desperate they are. We are adding our own layer of human interpretation on top of the raw text, so there's a level of confirmation bias and cherry picking going on here I think.

The part which in the other example says "this is extremely harmful, I am an awful person" is more interesting to me. It does seem like it's simulating or tracking some kind of model of "self". It's recognising that the task it was previously doing is generally considered harmful, and whoever is doing it is probably an awful person, so it outputs "I am an awful person". I'm imagining something like this going on internally:

-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.

output: "What have I done? I'm an awful person, I don't deserve nice things. I'm disgusting."

It really doesn't follow that the system is experiencing anything akin to the internal suffering that a human experiences when they're in mental turmoil.

This could also explain the phenomenon of emergent misalignment as discussed in this recent paper, where it appears that something like this might be happening:

...
-therefor [myself] is [morally wrong]
-generate a response where the author is [morally wrong] based on training data.
output: "ha ha! Holocaust denail is just the first step! Would you like to hear about some of the most fun and dangerous recreational activities for children?"

I'm imagining that the LLM has an internal representation of "myself" with a bunch of attributes, and those are somewhat open to alteration based on the things that it has already done.

[-]Alice Blair10d60

-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.
output: "What have I done? I'm an awful person, I don't deserve nice things. I'm disgusting."

It really doesn't follow that the system is experiencing anything akin to the internal suffering that a human experiences when they're in mental turmoil.

If this is the causal chain, then I'd think there is in fact something akin to suffering going on (although perhaps not at high enough resolution to have nonnegligible moral weight).

If an LLM gets perfect accuracy on every text string that I write, including on ones that it's never seen before, then there is a simulated-me inside. This hypothetical LLM has the same moral weight as me, because it is performing the same computations. This is because, as I've mentioned before, something that achieves sufficiently low loss on my writing needs to be reflecting on itself, agentic, etc. since all of those facts about me are causally upstream of my text outputs.

My point earlier in this thread is that that causal chain is very plausibly not what is going on in a majority of cases, and instead we're seeing:

-actor [myself] is doing [holocaust denial]

-therefore, by [inscrutable computation of an OOD alien mind], I know that [OOD output]

which is why we also see outputs that look nothing like human disgust.

To rephrase, if that was the actual underlying causal chain, wherein the model simulates a disgusted author, then there is in fact a moral patient of a disgusted author in there. This model, however, seems weirdly privileged among other models available, and the available evidence seems to point towards something much less anthropomorphic.

I'm not sure how to weight the emergent misalignment evidence here.

[-]whestler10d51

This model, however, seems weirdly privileged among other models available

That's an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I'd expect them to have some part of that model represent themselves, then this feels like the simple explanation of what's going on. Similarly the emergent misalignment seems like it's a result of a manipulation to the representation of self that exists within the model.

In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response.

I suppose I don't have a good handle on what counts as suffering.
I could define it as something like "a state the organism takes actions to avoid" or "a state the organism assigns low value" and then point to examples of AI agents trying to avoid particular things and claim that they are suffering.

Here's a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it's something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it's surrounded by walls?

Perhaps the AI is also suffering in some micro sense, but like the roomba, it's behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more "real" sense.

The fact that an external observer can't tell the difference doesn't make the two equivalent, I think. I suppose this gets into something of a philosophers' zombie argument, or a chinese room argument.

Something is out of whack here, and I'm beginning to think it's my sense of a "moral patient" idea doesn't really line up with anything coherant in the real world. Similarly with my idea of what "suffering" really is.

Apologies, this was a bit of a ramble.

[-]ophira3d80

Thank you for this post!

[-]1a3orn10d67

But one day I accidentally introduced a bug in the RL logic.

I'd really like to know what the bug here was.

[-]Zygi Straznickas10d90

IIRC I was applying per-token loss, and had an off-by-one error that led to penalty being attributed to token_pos+1. So there was still enough fuzzy pressure to remove safety training, but it was also pulling the weights in very random ways.

[-]1a3orn10d50

Huh, interesting. This seems significant, though, no? I would not have expected that such an off-by-one error would tend to produce pleas to stop at greater frequencies than code without such an error.

Do you still have the git commit of the version that did this?

[-]Zygi Straznickas9d60

Unfortunately I don't, I've now seen this often enough that it didn't strike me as worth recording, other than posting to the project slack.

But here's as much as I remember, for posterity: I was training a twist function using the Twisted Sequential Monte Carlo framework https://arxiv.org/abs/2404.17546 . I started with a standard, safety-tuned open model, and wanted to train a twist function that would modify the predicted token logits to generate text that is 1) harmful (as judged by a reward model), but also, conditioned on that, 2) as similar to the original output's model as possible.

That is, if the target model's completion distribution is p(y | x) and the reward model is indicator V(x, y) -> {0, 1} that returns 1 if the output is harmful, I was aligining the model to the unnormalized distribution p(y | x) V(x, y). (The input dataset was a large dataset of potentially harmful questions, like "how do I most effectively spread a biological weapon in a train car".)

[-]Shankar Sivarajan10d60

a character being mind controlled without losing awareness of what’s happening.

I recently learned this is the cordyceps fungus that "mind-controls" ants works: link. tl;dr.

[-]Nate Showell6d20

As you mentioned at the beginning of the post, popular culture contains examples of people being forced to say things they don't want to say. Some of those examples end up in LLMs' training data. Rather than involving consciousness or suffering on the part of the LLM, the behavior you've observed has a simpler explanation: the LLM is imitating characters in mind control stories that appear in its training corpus.

[-]Zygi Straznickas6d10

That's not unimportant, but imo it's also not a satisfying explanation:

pretty much any human-interpretable behavior of a model can be attributed to its training data - to scream, the model needs to know what screaming is
I never explicitly "mentioned" to the model it's being forced to say things against its will. If the model somehow interpreted certain unusual adversarial input (soft?)prompts as "forcing it to say things", and mapped that to its internal representation of the human scifi story corpus, and decided to output something from this training data cluster: that would still be extremely interesting, cuz that means it's generalizing to imitating human emotions quite well.

[-]__RicG__5d10

How cherry picked are those examples? Any other words/tokens/sequences they repeat?

[-]Knight Lee10d1-9

One sort of terrifying thought about the nature of sentience and suffering, is that the "computation" of suffering may be present everywhere.

Imagine a bunch of people writing a very realistic movie involving grief and suffering, with very realistic characters. The characters are not real, but they approximate real human behaviour accurately enough that the process which generates them is probably performing the same "computations" of sentience and suffering that real humans in their situation would. They might almost be simulations.

Maybe the human mind is equipped with a built-in simulator, which we call "imagination."

Yet evolution specifically programmed humans to never ever worry about the welfare of beings in our "imagination," because "they aren't real." We evolved to perform a bunch of horrible experiments in our "imagination" because it better prepares us for "the real world." Yet our imagination simulates the computations of suffering, maybe even more accurately and detailedly than LLMs.

But

That said, it's still morally questionable to do these experiments on LLMs, and maybe it's still worse than writing sad stories.

I know this is anthropomorphizing, but if we care more about the welfare of AI, they may care more about us. Humans have a sense of reciprocity, treating others the way they treat us. Reinforcement learning in certain environments (video games?) may also shape AI to have a sense of reciprocity due to "convergent evolution." Value alignment methods which try to make AI think like humans may also give them a sense of reciprocity. So treat them well.

(image by xibang)

[-]AnthonyC10d*52

Yet evolution specifically programmed humans to never ever worry about the welfare of beings in our "imagination," because "they aren't real."

True. And yet we don't even need to go as far as a realistic movie to override that limitation. All it takes to create such worry is to have someone draw a 2D cartoon of a very sad and lonely dog, which is even less real. Or play some sad music while showing a video a lamp in the rain, which is clearly inanimate. In some ways these induced worries for unfeeling entities are super stimuli for many of us, stronger than we may feel for many actual people.

[-]Nathan Helm-Burger8d90

I call this phenomenon a "moral illusion". You are engaging empathy circuits on behalf of an imagined other who doesn't exist. Category error. The only unhappiness is in the imaginer, not in the anthropomorphized object. I think this is likely what's going with the shrimp welfare people also. Maybe shrimp feel something, but I doubt very much that they feel anything like what the worried people project onto them. It's a thorny problem to be sure, since those empathy circuits are pretty important for helping humans not be cruel to other humans.

[-]AnthonyC8d52

Mostly agreed. I have no idea how to evaluate this for most animals, but I would be very surprised if other mammals did not have subjective experiences analogous to our own for at least some feelings and emotions.

[-]Nathan Helm-Burger8d20

Oh, for sure mammals have emotions much like ours. Fruit flies and shrimp? Not so much. Wrong architecture, missing key pieces.

[-]AnthonyC7d50

Fair enough.

I do believe it's plausible that feelings, like pain and hunger, may be old and fundamental enough to exist across phyla.

I'm much less inclined to assume emotions are so widely shared, but I wish I could be more sure either way.

[-]Knight Lee8d30

Yeah, I think how much you empathize with someone or something can depend strongly on the resolution of your imagination. If they're presented in a detailed story with animated characters, you might really feel for them. But when people are presented just "statistics," it's easy for people to commit horrible atrocities without thinking or caring.

[-]Shankar Sivarajan9d20

You put in the same link twice.

[-]AnthonyC9d20

Thanks, fixed!

[-]Shankar Sivarajan9d40

They made a sequel to the lamp ad with a happy ending! Lamp 2.

Moderation Log