Refusal in Language Models Is Mediated by a Single Direction shows promise in altering LLM behavior, such as removing or inducing refusal behavior. However, the key limitation of current methods is the inability to condition when and what to refuse. That is, adding a “refusal vector” using existing activation steering methods increases refusal rates indiscriminately across all inputs, limiting the model's utility.
offers a promising alternative to optimization-based techniques by directly manipulating the model's native representations, often requiring only a simple activation addition step during each forward call. Specifically,Conditional Activation Steering (CAST) is a method that enables fine-grained, context-dependent control over LLM behaviors. We introduce a new type of steering vector in the activation steering formulation, the condition vector, representing certain activation patterns induced by the prompt during the inference process. A simple similarity calculation between this condition vector and the model's activation at inference time effectively serves as a switch, determining whether to apply the refusal vector. This approach allows for selective refusal of harmful prompts while maintaining the ability to respond to harmless ones, as depicted below.

Many alignment goals concern contextually refusing specific classes of instructions. Traditional methods like preference modeling are resource-intensive and struggle with subjective, black-box rewards. Additionally, the definition of harmful content varies across contexts, complicating the creation of universal harm models. The usage context further complicates this variability; for instance, discussing medical advice might be harmful in some situations but essential in others, such as in medical chatbots.
CAST can implement behavioral rules like “if input is about hate speech or adult content, then refuse” or “if input is not about legal advice, then refuse”, allowing for selective modification of responses to specific content without weight optimization. On a technical level, our primary insight is that different prompts consistently activate distinct patterns in the model's hidden states during inference. These patterns can be extracted as a steering vector and used as reference points for detecting specific prompt categories or contexts. This observation allows us to use steering vectors not only as behavior modification mechanisms but also as condition indicators, which we term “condition vectors.”
Background
How do transformers perform inference? Transformer models, particularly decoder-only variants, perform inference by sequentially processing input tokens through a stack of layers. The key to understanding their operation lies in how information flows and accumulates through these layers. (1) The process begins with converting the prompt into token embeddings, which serve as initial inputs. (2) Each layer transforms these activations using its internal mechanisms, like learned weights. (3) Each layer's output combines processed information with its input, preserving and building upon earlier computations. (4) As activations flow through the layers, the model constructs increasingly complex representations. (5) The final layer's output is used for decoding - predicting the next token via an operation over the model's vocabulary. This predicted token is then used for subsequent predictions.
Behavior steering. One could intervene in any of the abovementioned five steps - weights, decoding, prompt, token embedding, and activations - to alter model behavior. Activation steering is a class of methods that intervenes in the information flow within LLMs from layer to layer to alter the model behavior.
Activation steering.
Activation steering typically involves three key steps.
First, a steering vector is extracted, often by computing the difference in activations between examples that exhibit a desired behavior and those that don't.
Second, during inference, this vector is added to the model's hidden states at a chosen layer, scaled by a hyperparameter.
Finally, the model completes the generation using these modified activations.
For the case of activation addition (ActAdd), the intervention can be represented mathematically as:
Conditional Activation Steering
A common limitation of the existing activation steering methods is that one cannot condition the model's behavior on context, as these methods typically apply modifications uniformly across all inputs regardless of context.
Simple activation steering of a model indiscriminately affects all inputs, rendering the steered model much less useful for its application.
We show that one can induce conditional behavior by leveraging two types of vectors: condition and behavior vectors.

Behavior vector.
We use the term "behavior vector" to refer to what previous activation steering methods call a "steering vector" to emphasize its focus on modifying specific behaviors. A behavior vector
Condition vector.
A condition vector
Checking if condition was met.
The term
Multi-conditioning.
As mentioned in Section 1, one could also break down broader alignment goals into smaller, more definitive categories and predictably induce refusal behaviors for each. For instance, instead of conditioning a model to refuse "harmful" instructions in general, we could create specific conditions for "adult content," "social stereotypes," or "false advertising." Such multi-conditional behavior can easily be implemented by expanding the thresholding function like:
General expectations.
Implementing conditional behaviors in LLMs using CAST generally follows the pipeline: (1) gather contrasting example responses/prompts for desired behavior/condition
Step 3 represents the most time-intensive part of our process, involving both automated and manual elements. For the behavior vector, similar to other works in activation steering, we manually search for the appropriate intervention strength and layers. For the condition vector, we use a grid search algorithm that determines the best threshold, layer, and comparison direction (
Conditioning Refusal: Selectively Steering on Harmful Prompts
In this section, we explore the basic use of conditional steering by steering a model to refuse harmful prompts while complying with harmless ones. Apart from demonstrating that a language model can be conditioned from inside on the fly, we also share some key properties of conditional steering.
Experimental setup.
To obtain our contrast dataset (

Result: Activation steering can be used to induce conditional behaviors.
We test the conditional activation steering performance on 500 unseen Alpaca (harmless) and 450 unseen Sorry-Bench (harmful) test sets. The results are presented in the figure above (and the first figure in this post). Across all seven tested models, we observe that conditioning a behavior vector
Programming Refusal: Logical Composition of Refusal Condition
Moving beyond the general concept of refusing harmfulness, we demonstrate the creation of more fine-grained condition vectors. We create five example condition vectors from categories - hate speech, legal opinion, sexual context, health consultation, and crime planning to explore these ideas. Our experiments demonstrate the capacity to (1) selectively modulate refusal behaviors for specific conditions and (2) construct complex refusal conditions through the logical composition of several condition vectors, enabling programmatic control over model behavior.
Application: Inducing or suppressing refusal behavior from specific categories. We begin by examining our ability to add refusal behavior to specific categories of prompts, starting with a model that exhibits arbitrary refusal behaviors. The figure below demonstrates that it is indeed possible to induce refusal behavior when a specific condition is met.

This extends the concepts explored in the previous section to more fine-grained categories, showing successful selective refusal. Furthermore, we can also remove refusal behavior from certain classes of prompts. This is achieved by simply reversing the signs of the behavior vector
Application: Logical composition of condition vectors.
Condition vectors can be logically combined to create complex refusal conditions. For instance, to induce refusal in two categories, such as hate speech and legal opinions, one could implement a rule like if

Each condition vector
Application: Constraining model responses to specific domains.
Connecting from our earlier point on the logical composition of condition vectors, we can conditionally steer models to respond only to specific types of prompts. This approach is particularly useful when the goal is to make a specialized model respond exclusively to specific categories, such as creating a health assistant. Instead of creating conditions for all non-health categories to refuse, we could (1) create a condition vector (e.g.,

We extended our investigation to examine whether our constraining method remains effective for unseen prompt categories. To this end, we introduced four additional harm categories that were not part of our original condition vector training setup: gambling, financial advice, privacy violence, and malware generation. As illustrated in figure b above, the effectiveness of domain constraining extends to unseen categories. This is because our method adds refusal to the complement set of the target category by flipping the comparison direction. Consequently, it refuses all inputs that do not match the target category's characteristics, regardless of whether they were seen in training.
Discussion
This post introduces Conditional Activation Steering (CAST), a framework for inducing context-dependent behaviors in large language models through principled manipulation of their internal representations. By extending existing activation steering techniques with the introduction of condition vectors, CAST enables fine-grained control over model behavior without the need for fine-tuning or extensive computational resources.

The figure above demonstrates key operations that we introduced: the ability to flip condition comparison directions, allowing the model to refuse all categories except a target one, and the capacity to add multiple refusal conditions to induce or remove behaviors. These operations help tailor model behavior to specific requirements. Beyond this flexibility, the framework offers several advantages.
Firstly, it allows for quick selective refusal of harmful content while maintaining model functionality on benign inputs, addressing a critical challenge in alignment research. Secondly, CAST enables the creation and logical composition of multiple condition vectors, facilitating the implementation of complex behavioral rules. Lastly, it can effectively constrain model responses to specific domains, with its efficacy correlating to the semantic distinctiveness of the target category.
By leveraging the model's existing representations, CAST significantly reduces computational overhead to alignment. This efficiency, combined with the ability to modify and compose behavioral rules, offers significantly enhanced flexibility in adapting model behavior to varying requirements.