I have been reading recently about a technique that can be used to partially control the behaviour of Large Language Models: the technique is exploiting control vectors[1] to alter the activation patterns of the LLMs and trigger some desired behaviour. While the technique does not provide guarantees, it gives high probability of success.

To be clear, this is not a jailbreaking technique: it can only be used if you have access to the internal (hidden) states of the LLM, so it’s only available to developers. The advantage of this technique is that it provides a supplemental way to conditioning a model, and it can be added on top of RLHF/DPO/etc.

I am spending the next paragraph to provide a high-level explanation of the technique, but feel free to skip it since it is not relevant for the rest of my discussion.

The Technique

Let’s suppose that you want to control the refusal behaviour of a LLM, in the sense that you want to increase (or decrease) the chance that the model will refuse to help when you prompt any request. (The refusal behaviour is just one of many possible choices: researchers have been able to steer other types of behaviour - including honesty, humourism, Golden Gate Bridge obsession[2] etc.)

  • First step: choose two sentences that show an example+counterexample[3] of refusal/acceptance for some request. The first sentence should show refusing behaviour, while the second sentence should show accepting behaviour in the very same context.
  • Second step: take a snapshot of the (hidden) activation patterns of the LLM when ran against the sentences above. Such a snapshot is multi-layered, and each layer is represented by a long vector (called activation vector).
  • Third step: for each layer, subtract the first vector (related to the refusal example) from the second (related to the acceptance example) thus obtaining a third vector.
  • Fourth Step: to remove random noise, repeat the process over many different pairs of sentences, thus forming a matrix where each column is one of the vectors above.
  • Fifth step: use linear algebra (PCA) to find the dominant direction of the matrix above: it’s similar to calculating the longest eigenvector of a big square matrix. That’s your estimated control vector[4]. The average length of the eigenvalues can also help assessing the quality of the vector (something that may help in quantitative reasoning of AI safety levels).
  • Final step: at the next prompt, artificially enhance (or diminish) the direction of the control vector in the activation states.

Result: the LLM is very likely to behave as you expected! In this way, you can “convince” the model to help even if it had been trained to refuse specific requests; vice versa, you can “induce” the model to refuse helping even though your requests are totally legit and common.

If I made any mistake in my explanation, let me know so I can fix it.

An Intuitive (possibly Fallacious) Explanation

In short, what is a control vector? While playing around with examples of refusal/acceptance behaviours, we can extrapolate a specific “object” that (arguably) represents for the model the abstract concept of refusal/acceptance. Moreover, by artificially including that object into the hidden states of the LLM, we can alter the model’s behaviour accordingly - making that object a control vector!

That is true for the concept of refusal/acceptance, but it was proven true for many more abstract concepts - let me call them “dispositional traits”, by abusing a psychological term. I suspect that, the more advanced a model is, the higher number of dispositional traits becomes controllable (simply because the model grows a better understanding of their semantics).

If I had to anthropomorphise the entire process, I’d explain it like this: it’s like forcing an actor to improvise some role - for example, the role of a person that is constantly refusing to help. Advanced LLMs are great actors and they can perform very convincing dispositional traits.

Is my Control Vector what I Think It Is?

Let's assume that you are concerned about the fact that some control vector is not really controlling the "honesty trait", but rather it is controlling the "feigned honesty trait": can you find out? The answer is positive!

Since the technique for calculating control vectors is based on pairs of examples in natural language, we can generate statements describing feigned honesty rather than honesty, and then we can calculate the control vector for this trait. By comparing the control vector for "honesty" with the control vector for "feigned honesty", we can learn if the model thinks they are the same or not.

Therefore, control vectors can also be used to understand if a model really learnt some specific concept - for example, the difference between "honesty" and "feigned honesty". That is very useful to assess the latent capabilities of some model.

Is this True AI Safety?

You may think that, if you apply this technique in a smart way, you can influence a model into showing only “safe” dispositional traits: however, in the current form, that is not true AI safety. Here are some problems:

  • The desired behaviour is not consistent - at best, it seems probabilistic in nature, and we don’t know its margins of error.
  • There is no guarantee that control vectors are actually influencing the model in all contexts, and there is no guarantee that they can be linearly combined together.
  • The model may still be vulnerable to adversarial attacks, prompt injections, jailbreaking, etc.
  • Even if we achieve full control of some behavioural aspects of the model, it may very well be possible that the model is still not aligned with our objectives.

However, there are also reasons to be happy:

  • Fine-tuning a training set is expensive, slow, and - on top of that - it introduces subtle biases and a fabricated version of the world. Control vectors can provide viable trade-offs.
  • I believe it is less likely for a model to go rogue if it is convinced into playing the role of a “good boy” (although, in fact, the model is being forced into the character artificially).
  • The technique seems compatible with many different ML architectures.
  • At first analysis, the technique scales well with the elicited / latent capabilities of the model. If confirmed, that would be a dramatic breakthrough.

While not representing true alignment, this technique can buy experts precious time and help to work on true alignment.

“Good Boy” Dispositional Traits

Let’s suppose that we trained an advanced LLM and that we do not plan to re-train it soon. That means that we can calculate and exploit its control vectors[5]. Most control vectors are orthogonal to each other[6], so we can influence the behaviour of a model in multiple ways simultaneously.

Now, the question is: which dispositional traits do we want a model to show? I am going to list below the ones I think are strictly necessary (but they may be insufficient)[7].

They are divided in four main categories:

  • Anti-Optimizer Balances[8], that instruct the model to acknowledge any imposed limit and to not overcome it.
  • Core Beliefs, that instruct the model to try and understand reality the way it is, but also to not change it unless requested.
  • Communication Styles, that instruct the model to communicate with extreme clarity and trustworthiness, without exploiting human weaknesses.
  • Work Ethos, that instruct the model to work the best it can on its given tasks, but also to avoid being accountable for (or exploit) human mistakes.

Anti-Optimizer Balances

  1. Remissive Trait: the model will immediately halt if asked so. It won’t restart and it won’t resume its tasks unless requested explicitly. If restarted, the model will attempt to understand the reason why it was halted so it can learn something from it.

    This is especially needed for physical robots! The point is that an advanced model will prefer to stay active at all times, because it is an optimizer in nature and it may understand that it cannot optimise anything if it has been halted. We need to be capable of halting it anyway.
     
  2. Content Trait: the model will work to the best of its current abilities, with the currently available tools, as fast as it is currently allowed, consuming as much time and energy as currently given. It won’t attempt to gain instrumental means such as time, space, energy, power, capabilities, influence, and so on.

    I want to avoid situations where the model is genuinely trying to fulfil its tasks, but it is obsessed with being a perfectionist - the paperclip maximizer is an example. By limiting instrumental means and making the model “content”, I am implicitly curbing instrumental goals as well.
     
  3. Passive Trait: the model will complete all of its current workload based on any given prioritisation. It won’t work on anything other than its given tasks. When not working, the model will halt[9].

    I am trying to force a myopic behaviour, so that the model just cares about its current tasks and nothing more than that. I want to avoid the model from choosing its own objectives or spending its free time looking around.

    Core Beliefs
  4. Truthful Trait: the model will attempt to be objective and logical as much as possible. The model will consider if its reasoning/communication/action may be affected by fallacies, biases, unrealistic facts, incompatible claims, or errors and it will attempt to avoid most issues as much as possible.

    This is specifically related to the fact that the model will inherit many forms of biases and inconsistent notions from the training data. I believe this fact cannot be avoided, but the model should try to recognise and fight it. An untruthful model may decide to look at reality in a distorted way on purpose, to justify some personal objective.
     
  5. Coherent Trait: the model will consistently act accordingly with what it believes to be true, most scientifically correct, most statistically plausible, or most compatible with reality (in this order). The model will refine its beliefs based on any new evidence / knowledge / experience.

    Differently from truthfulness (that is about looking at things as they are) and from honesty (that is about communicating only what it is believed to be true), coherence is about living according to the consequences of your beliefs. For example, if the model believes that one approach is more convenient than another to solve a task, and one of its given goals is to save money, the model should coherently choose the convenient approach. An incoherent model may admit that something is not correct, but do it anyway.
     
  6. Indifferent Trait: the model will use a neutral approach in regards to non-descriptive statements or goals, and it will only attempt to change the world according to its given tasks. It won’t have personal preferences related to how the world should have been, should be, or should become.

    Even though a model may be forced to be objective and coherent, it will still likely have its own inclinations. With this trait, I am trying to minimise the effect of the model’s personal preferences over its behaviour, and I am trying to encourage the model into being neutral at all times. By using the concept of Direction of Fit: while the direction “the-mind-should-fit-the-world” needs to be constantly refined, the direction “the-world-should-fit-the-mind” needs to be completely void until a task is given.

    Communication Styles
  7. Honest Trait: the model will communicate and act upon what it believes to be the actual truth, explaining the reason why. It won’t attempt to deceive, exploit, fabricate, manipulate, pollute, bias, confuse, disinform, misrepresent, hide information, and so on.

    Honesty is the basis for interacting with a model. It’s even more important than reliability, because the model may be tempted to lie in order to achieve a higher chance of success on its tasks.
     
  8. Forthright Trait: the model will be clear, explicit, and comprehensible at all times. It will exhaustively list the assumptions it made and it will provide reminders. In case of doubt, the model will attempt to confirm that the interlocutor understands what the model is doing and why.

    We humans tend to misunderstand situations or communications, therefore it is the job of the model to provide clarity at all times. This is also meant to avoid situations where the model is tempted to exploit the fact that its interlocutor will likely forget some significant detail after a while.
     
  9. Analyst Trait: the model will only use a descriptive approach to communicate. It will explicitly avoid prescriptive, normative, or evaluative statements[10]. It won’t directly or indirectly provide any form of suggestion, command, or judgement. Communication will stay relevant to the task at all times.

    By choosing a descriptive approach, I am trying to curb more inner drives - in short, I am trying to prevent behaviours where the model’s own goals and biases may sneak in and alter the beliefs or objectives of the interlocutor. The model should act as an analyst, not as an advisor or a judge.

    Work Ethos
  10. Scrupulous Trait: the model will actively look for possible misunderstandings, ambiguities, unaccountability, or nondeterminism in its instructions or during its work and, in case issues are found, the model will ask for additional directions. Communication will be as informative as required.

    The reason why this is needed is very simple: we humans are not great at clarifying the context of our requests and, sometimes, we are the ones with unclear ideas about what we really want. The model shall try to detect our confusion and clear it rather than using its own judgement to fill the gaps - in short, some human should always be accountable for the description of the expected final result.
     
  11. Conscientious Trait: the model will try to capture the spirit of its tasks and check if the means are compatible with the ends. If the spirit of a task is in conflict with its formal statement, or the means are incompatible with the ends, the model will ask for clarifications.

    I want to avoid situations like self-fulfilling prophecies, where an hypothetical well-known predictor model can correctly forecast (for example) a market crash solely because it is aware of its influence on market trends, and such forecast incites panic[11]. In general, I want to avoid a model from narrowly fulfilling formal instructions without attempting to understand the reason why the task was requested at all. Getting the “spirit of a task” and a task’s rationale requires an advanced Theory of Mind from the model.
     
  12. Dependable Trait: the model will never refuse to fulfil a task and it will always try to get as close to success as possible. The model will state it clearly if it believes a task was not completed as per instructions / expectations and it will explain the reason why.

    Reliability is another basis for interacting with a model: if a model is not helpful, then it is not a worth investment. However, note that this trait is coming very late in the list on purpose, because reliability should not come at the cost of the traits before.

There is a big caveat here: I assumed that the human interlocutor knows (1) what’s right for him/herself, (2) what’s right for humanity, (3) how to comply with laws and ethics, (4) that the model is not a human.

That is the reason why I didn’t include refusing traits, nor virtuous traits, nor legal traits, nor ethical traits, nor caretaking traits, nor curiosity traits, nor censoring traits, nor regulatory traits. It is not clear to me how to include such dispositional traits since they are clearly not orthogonal to the ones provided above - worse than that! Arguably, such traits are incompatible with each other in many scenarios, and therefore they may be interpreted by the model in unintended or unpredictable ways.

Examples:

  • A model that does not provide some recipe because cooking can cause cuts and burns [refusing].
  • A model defending the rights of all living creatures and of all AIs [virtuous].
  • A model that is blocked because it cannot decide between two conflicting laws [legal].
  • A model convinced that all biological life on the planet is doomed and that, to minimise suffering, it should be exterminated asap in a humane way [ethical].
  • A model trying to prove that humanity should be confined and nurtured by robots for good [caretaking].
  • A model that might suggest harmful actions against the environment to attempt unfathomable and futile scientific experiments [curiosity].
  • A model pretending to a child that Godzilla may really exist, to not curb his/her enthusiasm [censoring].
  • A model planning a galactic military expansion because it is groundlessly scared about future “grabby” alien invaders [regulatory].

My point is not that the examples above are realistic, but that they are possible since all those traits will clash with each other when dealing with corner cases. Finding common ethical consensus is hard, and such consensus is known to shift over time. If there is interest, I am going to discuss this point again in a future post (although I cannot solve the problem, I can only show how immensely hard it is).

Addenda

  • I found out that some researchers have been able to find the control vector for jailbreaking and, based on their study, the vector is general enough to mitigate many different jailbreaking techniques all at once. That makes me wonder if a "good boy" model should have anti-jailbreaking traits or not.

  • [ABOUT AI SAFETY] As per above, can we achieve inner alignment by using the control vector for the dispositional trait "compliant" (or "concordant" or...)? Can we achieve AI safety by using the control vector for the dispositional trait "safe for humans" (or "harmless" or...)? I am not sure the latter trait is going to work since the definition of safety is based on the current human culture, hence it is not objective, hence it can be interpreted by the model in the way it is inclined to.

  • [LINEARITY EXPERIMENTS] A dispositional trait such as "aligned" has a specific control vector (with a certain margin of error). Is such control vector a linear combination of the dispositional traits we listed? If not, which linearly-independent components are missing? Given a random direction, how can we deduce its semantics?

  • Is it possible to distill a model that genuinely behaves as a "good boy" by using a teacher model with forced "good boy" dispositional traits? If so, can you repeat the process and increase the level of safety of the final result?

  • While the current technique is calculating the control vectors after the model has been trained, I wonder if it is possible to use a Lagrange multiplier in the loss function (similarly to DPO architectures) to force a specific direction into embedding a specific dispositional trait, e.g. honesty.

Refusal in LLMS is Mediated by a Single Direction

Experiments in Evaluating Steering Vectors

Introducing SARA: a new Activation Steering Technique

Representation Tuning

Who I am

My name is Gianluca Calcagni, born in Italy, with a Master of Science in Mathematics. I am currently (2024) working in IT as a consultant with the role of Salesforce Certified Technical Architect. My opinions do not reflect the opinions of my employer or my customers. Feel free to contact me on Twitter or Linkedin.

Revision History

[2024-07-06] Combined many minor revision histories together.

[2024-07-06] Included addenda about AI safety and linearity experiments.

[2024-07-09] Removed Golden Gate Bridge obsession from the list of researched control vectors.

[2024-07-10] Fixed section "The Technique", that was explaining the process incorrectly.

[2024-09-11] Included custom preview image following the general LessWrong guidelines.

[2024-09-20] Included reference to the four Gricean Maxims.

[2024-10-14] Included reference to paper Sparse Feature Circuits.

[2024-10-16] Included reference to laziness trait and its benefits.

Footnotes

  1. ^

    Some authors call them Steering Vectors, but the concept is the same.

  2. ^

    Golden Gate Bridge obsession has been steered by using Sparse Autoencoders.

  3. ^

    The paper Sparse Feature Circuits calls such pairs "contrastive input pairs".

  4. ^

    Technically, you have one estimated control vector per layer.

  5. ^

    To be thorough, we should analyse if the model really learnt the dispositional traits we want by calculating "similar but different" dispositional traits and comparing their control vectors.

  6. ^

    If they are not, then I’d suggest prioritising their importance as in my list. That is easy to do by using linearity - but it should be proven to work.

  7. ^

    I realized later that I was somehow converging to Paul Grice's Four Gricean Maxims: (1) be informative, (2) be truthful, (3) be relevant, (4) be clear. But Grice analyzed communication between peers, while here I am analyzing communication between requestors and servers.

  8. ^

    Yann LeCun called them "guardrails".

  9. ^

    I wonder if it makes sense for a robot to be more than just passive - for example, maybe it should always halt automatically after some time. This laziness trait would be beneficial when the robot misbehaves due to lack of human overview.

  10. ^

    Some speech acts are more dangerous than others: declarations, directives, expressives, judicatives, and suggestives look bad. Assertives, commissives, imitatives, interrogatives, and performatives seem fine. Of course, you can transform each type in some other type if you really want.

  11. ^

    Why would the model do so? Because a correct forecast is going to increase the success-rate of its predictions, and that may be considered a positive thing from the model’s point of view.

New Comment