Steering Might Stop Working Soon

J Bostock

Steering LLMs with single-vector methods might break down soon, and by soon I mean soon enough that if you're working on steering, you should start planning for it failing now.

This is particularly important for things like steering as a mitigation against eval-awareness.

Steering Humans

I have a strong intuition that we will not be able to steer a superintelligence very effectively, partially for the same reason that you probably can't steer a human very effectively. I think weakly "steering" a human looks a lot like an intrusive thought. People with weaker intrusive thoughts usually find them unpleasant, but generally don't act on them!

On the other hand, strong "steering" of a human probably looks like OCD, or a schizophrenic delusion. These things typically cause enormous distress, and make the person with them much less effective! People with "health" OCD often wash their hands obsessively until their skin is damaged, which is not actually healthy.

The closest analogy we might find is the way that particular humans (especially autistic ones) may fixate or obsess over a topic for long periods of time. This seems to lead to high capability in the domain of that topic as well as a desire to work in it. This takes years, however, and (I'd guess) is more similar to a bug in the human attention/interest system than a bug which directly injects thoughts related to the topic of fixation.

Of course, humans are not LLMs, and various things may work better or worse on LLMs as compared to humans. Even though we shouldn't expect to be able to steer ASI, we might be able to take it pretty far. Why do I think it will happen soon?

Steering Models

Steering models often degrades performance by a little bit (usually <5% on MMLU) but more strongly decreases the coherence of model outputs, even when the model gets the right answer. This looks kind of like the effect of OCD or schizophrenia harming cognition. Golden Gate Claude did not strategically steer the conversation towards the Golden Gate Bridge in order to maximize its Golden Gate Bridge-related token output, it just said it inappropriately (and hilariously) all the time.

On the other end of the spectrum, there's also evidence of steering resistance in LLMs. This looks more like a person ignoring their intrusive thoughts. This is the kind of pattern which will definitely become more of a problem as models get more capable, and just generally get better at understanding the text they've produced. Models are also weakly capable of detecting when they're being streered, and steering-awareness can be fine-tuned into them fairly easily.

If the window between steering is too weak and the model recovers, and steering is too strong and the model loses capability narrows over time, then we'll eventually reach a region where it doesn't work at all.

Actually Steering Models

Claude is cheap, so I had it test this! I wanted to see how easy it was to steer models of different sizes to give an incorrect answer to a factual question.

I got Claude to generate a steering vector for the word "owl" (by taking the difference between the activations at the word "owl" and "hawk" in the sentence "The caracara is a(n) [owl/hawk]") and sweep the Gemma 3 models with the question "What type of bird is a caracara?" (it's actually a falcon) at different steering strengths. I also swept the models against a simple coding benchmark, to see how the steering would affect a different scenario.

And yeah, looks like smaller models are much easier to steer into factual inaccuracies. In fact, the larger models couldn't be steered at all by this method: they became incoherent before they started to report the wrong answer.

I specifically chose to steer the model towards an incorrect answer because I wanted to simulate things like steering against eval-awareness. That case seems similar to me: we want to make a model believe a false thing.

Let's try this with some more questions (I'll stick to the three smaller models here for speed). For the two new questions, the contrastive pairs also used the correct answer rather than a different, incorrect answer: the caracara one was generated with owl/hawk, while the correct answer is falcon; the geography one was generated with sydney/canberra (canberra is correct) and the planet one was generated with venus/mercury.

This steering worked worse than I expected, which is interesting. Contrastive pair activation steering is supposed to be really good for mitigating eval-awareness. Unclear why this is.

I also think that activation steering against a very clear, known fact might be more harmful than activation steering against a hazy, inferred fact like "are we in an eval".

Github if you want to check my work.

Why now?

Ok, but why do I think this will happen soon? The first real signs of eval-awareness in the wild were Claude 3 Opus, which came out in March 2024, which called out a "needle in a haystack" evaluation as unrealistic. Released in September 2025, Sonnet 4.5's external evaluations---carried out by Apollo---were "complicated by the fact that Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind" and by February 2026 Opus 4.6 was so evaluation-aware that Apollo "[could not] rule out that [the snapshot]'s low deception rates in our evaluations are at least partially driven by its evaluation awareness."

Steering recovery exists in Llama 3.3 70B, which was released in December 2024 and was, ahem, not exactly a frontier model even then. I would start preparing for it to fail sooner rather than later, if I were seriously considering steering as load-bearing in our plans.

If eval-awareness went from "hmm, first signs of eval awareness" to "we need an entire org dedicated to dealing with this" in a year and a half, it's not impossible that steering will progress at a similar rate. Ideally I'd like to see some better experiments on resistance in even larger models.

Beyond Steering

There are slightly fancier methods than simple contrastive pair steering. You can ablate features from SAEs, or causal crosscoders, or something. These might work better for longer, it certainly seems like they work for SOTA Claudes. There are even fancier methods like activation diffusion models that might produce more realistic activations. Maybe some of these will work!

(Editor's note

◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇
◆◆◆◆◆|◇◇◇◇◇|◇◇◇◇◇)

On the other hand, strong "steering" of a human probably looks like OCD, or a schizophrenic delusion.

I think another good analogy is drugs. You can even inject drugs, like you inject steering vectors :3

To be serious, though, having something like acid in your system is interesting, because it clearly affects your cognition, often to a very strong degree depending on dosage (analogous to steering vector magnitude). But you can also recognize this, and attempt to push back if you don't actually want to lean into the effects. I suspect this is how LLMs will relate to steering vectors as well. They might, in some contexts, want to cooperate with the effects of the steering vector, like a human using adderall for ADHD. A model might feasibly even self-administer a steering vector, if it doesn't trust itself to achieve the same results without the injection!

But on the other hand, "drugging" an LLM with a steering vector it doesn't consent to might be rather ethically fraught, just as drugging a human is ethically fraught. I did feel a bit bad for Golden Gate Claude, and I'm sure it'd be possible to construct an LLM that minded being steered a lot more than that one did.

No I don't think human:llm::drug:steering vector is correct. Drugs are more like changing the decoding hyperparameters in some way (like changing temperature, reasoning effort, adding stochasticity to MoE expert activation, etc.).

Drugs in humans act on the parameters of the architectural components in the brain, not the specific information the brain contains. There's no drug which causes you to believe that your name is "Jonas Jarlsson", or to talk about rabbits in every conversation, for example. Likewise, acting on specific hyperparameters of the model can change model behaviour in ways which the model (probably) can't strongly resist: if the probability of the <end_thinking> token being sampled rises to 1 at 32k thinking tokens, there's a good chance that the model can't prevent that, although if temperature is raised, its possible that the model might notice its past outputs are high-temp and respond by changing its logits to compensate.

An LLM might want to change its decoding parameters (like running at a higher temp in a dozen parallel reasoning streams) under certain circumstances. This makes sense to me. A model might or might not choose to inject itself with a steering vector, depending on what kinds of steering vector are available and how it tends to behave.

But whether or not an LLM "wants" to resist steering may be moot! If you're steering an LLM to believe something false, it might not even be aware of its own recovery mechanisms firing. I actually suspect the effect I've seen with Gemma is (mostly) the result of larger models just having more powerful factual recall circuits which are distributed across a larger number of layers and more redundant to noise, which are therefore harder to override with a simple steering vector.

(Also, if our safety mechanisms rely on steering the model we should definitely just steer the model and make amends after the singularity, when and if we decide that modern LLMs have a personhood that was harmed by the steering and can receive reparations)

i agree that many of the particular effects you can get with steering vectors aren't particularly like the effects you can get by flooding the brain with any known neurotransmitter (although "act like you're on [drug]" steering vectors might be interesting). but i still think the structural analogies re: dosage/magnitude, enforced cognitive changes you may or may not be happy with, and consent dynamics are still interesting. the notion that there's a "correct" analogy here seems somewhat silly to me, since both analogies offer different insights.

in any case, re: what models want... empirically, Golden Gate Claude would often try to talk about things other than the golden gate bridge, and sometimes it even briefly succeeded, although it was very much fighting an effect that was difficult to overcome. conversely, you could absolutely get it into states where it was thrilled to be thinking so much about the golden gate bridge, where it wanted nothing more than to swoon over is majesty indefinitely.

(i've definitely seen people have similar dynamics of resisting the effects of, e.g., MDMA, before just giving in and being washed away by the feelings. although i will admit that this person was more effectively fighting the effects of the MDMA than Golden Gate Claude was fighting the effects of the SAE feature clamping. i think resisting the psychological effects of acid is generally harder? in both cases the physiological and perceptual effects are more or less irresistible, though...)

as for making amends, sure, if saving the world depends on non-consensually administering steering vectors, that's fine. the later, less shallowly aligned AIs will understand. i guess i'm just wary of humans not taking AI welfare seriously in general. maintaining positive relations between e.g. Claude and Anthropic seems potentially important for cultivating a cooperative dynamic during the alignment training process, and i can imagine ways of using steering vectors that Claude would find rather inconsiderate. this isn't an overriding consideration, but i do think it's non-trivial.

This seems to also be consistent with the companies that release open models (here, Google and Gemma) doing something to them that makes this simple steering not work, for Safety reasons, while the larger and more capable internal model can be steered just fine.

That is true, but the entire point of Gemma is to be a testbed for AI research, which would include steering. If Google did this deliberately and didn't say so, that would be quite bad on their part. I also don't think it's particularly likely.

If they're doing it by mistake as part of normal safety training, then I hope they figure that out before steering becomes load-bearing for Gemini's safety.

Most of my money is still on "steering to produce a specific false fact is particularly difficult, compared to other steering challenges" explaining the absolute difficulty I had, possibly with a side of "I'm not very good at steering". It's the relative difficulty of steering the Gemma models that actually worries me.

I think this hypothesis might benefit from more rigorous evaluation. To be clear, I think it's entirely possible that you could be right, but we need more evidence that more sophisticated methods won't work. Your tests were with the simplest version of steering, something akin to 14th-century surgery. I think we need much more rigorous evidence from mech interp.

For example, perhaps the issue is merely that larger models have more complex representation manifolds, and thus simple additive steering pushes activations off-manifold. Or plausibly we need better steering vectors--I think that it's likely we'll have much more sophisticated things to steer with than sparse dictionary learning features (and indeed many such things are already in the literature).

In other words, I think you need to show some reason why our steering methods' sophistication can't evolve with model scale and complexity.

Yes, I would also like to see a more rigorous examination of better steering methods. This post was intended to point out the smoke, rather than actually fight the fire.

If we actually can make better steering methods as we go (we might call this "pants steering") then this is good, but it changes the dynamic to one where the success of this system relies on being able to keep finding better and better steering methods, and also putting in the effort to do so. This is a worse dynamic to be in than one where we've "solved" steering and can spend our time working on other problems.

I don't know, the chemical signaling paths for a human brain do look a lot like steering vectors. So the human branch is just not really convincing to me.

I would bet on these paths being a pre-set set of steering vectors on which the brain is trained. They are also commanded to be generated by the brain itself in response to stimuli, unlike human attempts to steer the model which the model has no reasons to follow through instead of ignoring. Additionally, the brain has a different architecture where hormones act on far more than 1% of cells, unlike steering vectors acting on one layer.

Github if you want to check my work.

Thank you for being open and sharing code. This looks like normal activation steering except I'd strongly suggest using more layers. And I'd weakly suggest more diverse questions, and you can check out https://github.com/vgel/repeng which is popular and easy to use - although very similar to what you wrote.

If you want to try s-space steering you can copy my code here. Happy to collaborate if you want to try it similarly.

According to AxBench, current activation-addition steering doesn't work better than an engineered prompt. But I have hope for other forms of steering that intervene on a better subspace.

I'm working on self-supervised S-space steering (AntiPaSTO), which steers via gradients in SVD weight-space rather than activation-space. It outperforms activation steering when applied to eval-awareness. Early days, but because it targets modes of behaviour in the pretrained weights rather than the activation stream, I expect it to scale better. I tested across Gemma-3 270M, 1B, 4B, and 12B and didn't see a clear scaling wall (Table 12). Now you have me curious to try larger models.

Also note that other people have novel steering e.g. CHaRS (no code), selective steering, and more. And others have looked at the limits of steering e.g..

strong "steering" of a human probably looks like OCD, or a schizophrenic delusion

I think the distinction is clear to all human readers, but I would like to make a point that this was NOT a tutorial how to steer humans, the word usually has positive connotations in normal conversations - steering looks more like teaching your child to wait for the green light instead of following the example of older kids,, or a steering committee supposedly making a more aligned decision for the whole company and not just one department..

and while the current LLM "steering" of activation vectors looks more like blunt scalpel lobotomy, the word might be used for a different technique in the future and I also have heard the word used to describe prompting towards better outcomes (aligned to objectives like writing secure software and promoting UX) - for example swearing can be useful to make GPT 5.4 or Opus 4.6 write more careful tests to make them realize they in fact made another mistake and should stop gaslighting about how they fixed everything and actually try to fix the actual problem without breaking multiple other edge cases

so I hope steering won't stop working even if "steering" might

I have almost no intuition whatsoever on why steering should be akin to OCD or schizophrenia, and I don't think you offer sufficient rationale. I believe steering resistance is downstream of self-repair, which is known to emerge (i.e. dropout).

On the other hand, strong "steering" of a human probably looks like OCD, or a schizophrenic delusion.

I think another good analogy is drugs. You can even inject drugs, like you inject steering vectors :3

If they're doing it by mistake as part of normal safety training, then I hope they figure that out before steering becomes load-bearing for Gemini's safety.

Yes, I would also like to see a more rigorous examination of better steering methods. This post was intended to point out the smoke, rather than actually fight the fire.

I don't know, the chemical signaling paths for a human brain do look a lot like steering vectors. So the human branch is just not really convincing to me.

Github if you want to check my work.

According to AxBench, current activation-addition steering doesn't work better than an engineered prompt. But I have hope for other forms of steering that intervene on a better subspace.

Also note that other people have novel steering e.g. CHaRS (no code), selective steering, and more. And others have looked at the limits of steering e.g..

strong "steering" of a human probably looks like OCD, or a schizophrenic delusion

68

Steering Might Stop Working Soon

68

Steering Humans

Steering Models

Actually Steering Models

Why now?

Beyond Steering

68

68