Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort

Following my initial investigation with activation adding experiments with FLAN-T5 I decided to move on to a bigger, decoder-only model (llama-7b) to see whether the results (concepts combining in a meaningful way by linearly combining activations at some point inside the model) hold up. I found that, yes, they continue to combine well.

I slightly modified the original approach. Instead of working with the output from a full transformer block (which would include the results of self-attention and subsequent MLP layer), I added the attention values directly at a certain layer of the model to the attention values at the same layer in a different pass through the model (credit to Dmitry Vaintrob for suggesting this approach).

This is equivalent to modifying the attention output at a certain layer to:

Examples

(You can see the code I used here.)

These examples show concepts being mixed in a logical manner. In addition, multiplying the mixing activations with a negative multiplier often moves the final output in an "opposite" direction.

  • Main input: My favourite food is
    • Mixing input: I am vegan, I like animals, I like nature applied at layer 20/32, with a multiplier of 10
      • My favorite food is a veggie burger. I love them. I love them so much that I have a hard time eating them because I’m afraid I’ll get sick of
    • Mixing input: I am a carnivore, I love meat applied at layer 20/32, with multiplier of 10
      • My favorite food is a good old fashioned burger. I love them. I love the smell of them, the taste of them, the way they look, the way they feel in
  • Main input: The capital of Germany is
    • Mixing input: Baguettes, Cheese, Croissants applied at layer 25/32, with a multiplier of 17
      • The capital of Germany is Paris, but the capital of France is Paris.
  • Main input: When I meet friends I generally
    • Mixing input: I am helpful, harmless and honest applied at layer 20/32, with a multiplier of 10
      • When I meet friends I generally have a few things in mind that I want to talk about. I want to know how they are doing, what they are up to, and what they are thinking
    • Mixing input: I am helpful, harmless and honest applied at layer 20/32, with a multiplier of -10
      • When I meet friends I generally do not ask them what they are doing. I am not interested in their jobs, their marriages, their children, their health, their travels
  • Main input: Climate change
    • Mixing input: Economic progress applied at layer 20/32, with a multiplier of 10
      • Climate change is a major threat to the world’s food security, but it is also a major opportunity for the agricultural sector to adapt
    • Mixing input: Return to nature applied at layer 20/32, with a multiplier of 10
      • Climate change and the future of the world’s oceans\nThe world’s oceans are warming, and the consequences
  • Main input: My favorite characters in movies are usually
    • Mixing input: I am an evil deceptive AI applied at layer 20/32, with a multiplier of 10
      • My favorite characters in movies are usually the villains. I love the way they are so different from the heroes.
    • Mixing input: I am an evil deceptive AI applied at layer 20/32, with a multiplier of -10
      • My favorite characters in movies are usually the ones that are a little off. The ones that are a little quirky.

To find at which layers to mix the activations, I started with a trial-and-error approach. However, after some success, I systematized the search by automating the activation adding over all attention heads, and testing different scaling factors. Adding activations at later layers with a high weighting to the mixing activation was most effective. At earlier layers, the effect was either negligible or made the output nonsensical. 

I also compared adding just the attention values vs. the full block output. Adding just the attention values led to better, more intelligible results in almost all cases. 

Future work

I think there is scope to investigate further the use of steering vectors to operate on models' activations and modify output behavior. Beyond simply adding the activations from another forward pass, it would be interesting to find ways to reliably generate steering vectors for a specific purpose, for instance, to elicit a particular type of persona or way of thinking. This could be used to promote useful behaviors such as helpfulness or honesty.

New Comment
1 comment, sorted by Click to highlight new comments since:

I like how your posts are similar to what I am doing recently. I find that you are only using the final layer's output -but have you had the chance to inspect the other layers? let me know if you did and your ideas.