TurnTrout

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact
Becoming Stronger

Comments

TurnTroutΩ473

In light of Anthropic's viral "Golden Gate Claude" activation engineering, I want to come back and claim the points I earned here.[1] 

I was extremely prescient in predicting the importance and power of activation engineering (then called "AVEC"). In January 2023, right after running the cheese vector as my first idea for what to do to interpret the network, and well before anyone ran LLM steering vectors... I had only seen the cheese-hiding vector work on a few mazes. Given that (seemingly) tiny amount of evidence, I immediately wrote down 60% credence that the technique would be a big deal for LLMs: 

The algebraic value-editing conjecture (AVEC). It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as "run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a 'niceness vector', and then add the niceness vector to future forward passes."

Alex is ambivalent about strong versions of AVEC being true. Early on in the project, he booked the following credences (with italicized updates from present information):

  1. Algebraic value editing works on Atari agents
    1. 50% 
    2. 3/4/23: updated down to 30% due to a few other "X vectors" not working for the maze agent
    3. 3/9/23: updated up to 80% based off of additional results not in this post.
  2. AVE performs at least as well as the fancier buzzsaw edit from RL vision paper
    1. 70% 
    2. 3/4/23: updated down to 40% due to realizing that the buzzsaw moves in the visual field; higher than 30% because we know something like this is possible. 
    3. 3/9/23: updated up to 60% based off of additional results.
  3. AVE can quickly ablate or modify LM values without any gradient updates
    1. 60% 
    2. 3/4/23: updated down to 35% for the same reason given in (1)
    3. 3/9/23: updated up to 65% based off of additional results and learning about related work in this vein. 

And even if (3) is true, AVE working well or deeply or reliably is another question entirely. Still...

The cheese vector was easy to find. We immediately tried the dumbest, easiest first approach. We didn't even train the network ourselves, we just used one of Langosco et al.'s nets (the first and only net we looked at). If this is the amount of work it took to (mostly) stamp out cheese-seeking, then perhaps a simple approach can stamp out e.g. deception in sophisticated models.

  1. ^

    I generally think this work (https://arxiv.org/abs/2310.08043) and the GPT-2 steering work (https://arxiv.org/abs/2308.10248) are under-cited/-credited when it comes to the blossoming field of activation engineering, and want to call that out. Please cite this work when appropriate:

    @article{turner2023activation,
      title={Activation addition: Steering language models without optimization},
      author={Turner, Alex and Thiergart, Lisa and Udell, David and Leech, Gavin and Mini, Ulisse and MacDiarmid, Monte},
      journal={arXiv preprint arXiv:2308.10248},
      year={2023}
    }
    @article{mini2023understanding,
      title={Understanding and Controlling a Maze-Solving Policy Network},
      author={Mini, Ulisse and Grietzer, Peli and Sharma, Mrinank and Meek, Austin and MacDiarmid, Monte and Turner, Alexander Matt},
      journal={arXiv preprint arXiv:2310.08043},
      year={2023}
    }
TurnTroutΩ7144

I updated some from this, thanks. My interpretation is that goal-directed shards can be activated even by a context where the AI is expected to do stuff (eg a company assistant).

What happens if you run the evaluation with Hugo without the rest of the context? Does it still strategize?

In experiment C we find that the model strategically refuses evaluation tasks, with refusal rates increasing as negative consequences from displaying capable performance are implied to be larger.

Seems like only variants 1 and 2 introduced new info, and the rest are informationally equivalent. So Opus' deception was seriously modulated by shallow properties of the description, which makes me feel less concerned (as it fits under my current picture of AI priorities as highly contextual).

TurnTroutΩ335013

The Scaling Monosemanticity paper doesn't do a good job comparing feature clamping to steering vectors. 

To better understand the benefit of using features, for a few case studies of interest, we obtained linear probes using the same positive / negative examples that we used to identify the feature, by subtracting the residual stream activity in response to the negative example(s) from the activity in response to the positive example(s). We experimented with (1) visualizing the top-activating examples for probe directions, using the same pipeline we use for our features, and (2) using these probe directions for steering.

  1. These vectors are not "linear probes" (which are generally optimized via SGD on a logistic regression task for a supervised dataset of yes/no examples), they are difference-in-means of activation vectors
    1. So call them "steering vectors"!
    2. As a side note, using actual linear probe directions tends to not steer models very well (see eg Inference Time Intervention table 3 on page 8)
  2. In my experience, steering vectors generally require averaging over at least 32 contrast pairs. Anthropic only compares to 1-3 contrast pairs, which is inappropriate.
    1. Since feature clamping needs fewer prompts for some tasks, that is a real benefit, but you have to amortize that benefit over the huge SAE effort needed to find those features. 
    2. Also note that you can generate synthetic data for the steering vectors using an LLM, it isn't too hard.
    3. For steering on a single task, then, steering vectors still win out in terms of amortized sample complexity (assuming the steering vectors are effective given ~32/128/256 contrast pairs, which I doubt will always be true) 

In all cases, we were unable to interpret the probe directions from their activating examples. In most cases (with a few exceptions) we were unable to adjust the model’s behavior in the expected way by adding perturbations along the probe directions, even in cases where feature steering was successful (see this appendix for more details).

...

We note that these negative results do not imply that linear probes are not useful in general. Rather, they suggest that, in the “few-shot” prompting regime, they are less interpretable and effective for model steering than dictionary learning features.

I totally expect feature clamping to still win out in a bunch of comparisons, it's really cool, but Anthropic's actual comparisons don't seem good and predictably underrate steering vectors.

The fact that the Anthropic paper gets the comparison (and especially terminology) meaningfully wrong makes me more wary of their results going forwards.

TurnTroutΩ81010

If that were true, I'd expect the reactions to a subsequent LLAMA3 weight orthogonalization jailbreak to be more like "yawn we already have better stuff" and not "oh cool, this is quite effective!" Seems to me from reception that this is letting people either do new things or do it faster, but maybe you have a concrete counter-consideration here?

TurnTroutΩ330

When we then run the model on harmless prompts, we intervene such that the expression of the "refusal direction" is set to the average expression on harmful prompts:

Note that the average projection measurement and the intervention are performed only at layer , the layer at which the best "refusal direction"  was extracted from.

Was it substantially less effective to instead use 

?

We find this result unsurprising and implied by prior work, but include it for completeness. For example, Zou et al. 2023 showed that adding a harmfulness direction led to an 8 percentage point increase in refusal on harmless prompts in Vicuna 13B. 

I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.

using this direction to intervene on model activations to steer the model towards or away from the concept (Burns et al. 2022

Burns et al. do activation engineering? I thought the CCS paper didn't involve that. 

TurnTroutΩ781

Because fine-tuning can be a pain and expensive? But you can probably do this quite quickly and painlessly. 

If you want to say finetuning is better than this, or (more relevantly) finetuning + this, can you provide some evidence?

TurnTroutΩ6107

I would definitely like to see quantification of the degree to which MELBO elicits natural, preexisting behaviors. One challenge in the literature is: you might hope to see if a network "knows" a fact by optimizing a prompt input to produce that fact as an output. However, even randomly initialized networks can be made to output those facts, so "just optimize an embedded prompt using gradient descent" is too expressive. 

One of my hopes here is that the large majority of the steered behaviors are in fact natural. One reason for hope is that we aren't optimizing to any behavior in particular, we just optimize for L2 distance and the behavior is a side effect. Furthermore, MELBO finding the backdoored behaviors (which we literally taught the model to do in narrow situations) is positive evidence.

If MELBO does elicit natural behaviors (as I suspect it does), that would be quite useful for training, eval, and red-teaming purposes.

TurnTroutΩ24555

A semi-formalization of shard theory. I think that there is a surprisingly deep link between "the AIs which can be manipulated using steering vectors" and "policies which are made of shards."[1] In particular, here is a candidate definition of a shard theoretic policy:

A policy has shards if it implements at least two "motivational circuits" (shards) which can independently activate (more precisely, the shard activation contexts are compositionally represented).

By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction):

On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It's just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion. 

  • This definition also makes obvious the fact that "shards" are a matter of implementation, not of behavior.
  • It also captures the fact that "shard" definitions are somewhat subjective. In one moment, I might model someone is having a separate "ice cream shard" and "cookie shard", but in another situation I might choose to model those two circuits as a larger "sweet food shard."

So I think this captures something important. However, it leaves a few things to be desired:

  • What, exactly, is a "motivational circuit"? Obvious definitions seem to include every neural network with nonconstant outputs.
  • Demanding a compositional representation is unrealistic since it ignores superposition. If  dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have  shards, which seems obviously wrong and false.

 That said, I still find this definition useful.

 I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing.

  1. ^

    Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.

TurnTroutΩ350

the hope is that by "nudging" the model at an early layer, we can activate one of the many latent behaviors residing within the LLM.

In the language of shard theory: "the hope is that shards activate based on feature directions in early layers. By adding in these directions, the corresponding shards activate different behaviors in the model." 

Load More