Andrew Mack - LessWrong

Mechanistically Eliciting Latent Behaviors in Language Models

This is an interesting question!

I just checked this. The cosine similarity of and $δ_{22}$ is .52, which is much more similar than you'd expect from random vectors of the same dimensionality (this is computing the $δ$ 's across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).

If you restrict to calculating $δ$ 's at just the assistant tag at the end of the prompt, the cosine similarity between $δ_{9}$ and $δ_{22}$ goes up to .87.

Interestingly, the cosine similarities in $δ$ 's seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the $δ$ 's (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I'll have to try this at some point.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack11d20

Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).

I haven't experimented with this, but you could also imagine using only "soft" orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack11dΩ110

Thanks for your comment! Yes, I’d say that roughly sums things up.

As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens.

Moreover, this might chain well with mechanistic anomaly detection. Quoting the post: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.”

I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack14dΩ110

Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack14dΩ231

Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested "one could maximize the cosine similarity between the differences in target activations across multiple prompts". The fact that it seems to work so well in diffusion models gives me hope that it will also work in LLMs! My guess is that ultimately you get the most mileage out of combining the two objectives.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack15dΩ330

Yes, the learned vectors are always applied at every token (for all examples).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack15dΩ110

I haven't tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around ) on the bomb-making prompt for Qwen-1.8B. These didn't appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack15dΩ110

Thanks for your comment! Here are my thoughts on this:

I agree that a more automated way of choosing hyper-parameters is an obvious and important next step! I have some ideas here, but it is certainly not a solved problem. Here are some rough ideas, in order of compute costs:
1. Develop some useful heuristics based off diversity measures of steered completions. For example, for each value of R you could calculate sentence embeddings of the steered completions for a small number of learned vectors, and then use the summed variance in sentence embeddings as your diversity metric. Then, plot diversity as a function of R. My guess is that this might look something like a “hockey stick”: for small values of R you get essentially zero diversity, but you quickly hit a threshold/phase transition where diversity sky-rockets. The values of R with super high diversity are probably not what you want (they will be incoherent). Instead, you want a value of R right at the cusp of the phase transition. You could then scale out to thousands of vectors at this “cusp” value of R. The trick is coming up with a useful heuristic for identifying the cusp, which would likely require more experience applying the method to a diverse range of prompts/examples. Ultimately, this feels largely analogous to identifying “elbows” in scree plots. This obviously depends on there being obvious cusps in the first place (a priori it's not clear if this will happen).
2. Fine-tune an LLM to identify the best value of R. This is similar to the above idea, except that instead of trying to come up with a heuristic for identifying cusps, we use a fine-tuned LLM as the heuristic. In other words, we manually decide the best R for a number of example prompts and fine-tune the LLM to predict which value of R is best on a new prompt we care about (you could imagine a number of different ways of specifying the details of the fine-tuning).
3. Just use many different values of R. If we’re already training thousands of vectors and using a trusted LLM to flag weird behavior, then multiplying this by N values of R may not be too costly, for N reasonably sized.
Your point about distinguishing between intentional backdoors and strange OOD behavior seems pretty important. I haven’t thought carefully about how you might distinguish the two in general, but I have thought a little bit about whether steering vectors might be helpful in recovering triggers. In particular, if the backdoor is something like “string X in the prompt elicits string Y in the response”, then I have this vague intuition that GCG-like attacks to discover X might work better if they targeted cosine-similarity with steering vectors which elicit Y (as you suggest). My reasoning here is that it “feels like” the optimization landscape should be more well-behaved if you’re targeting similarity in activations at some intermediate layer, as opposed to targeting at the logits level, the general principle being that if there are fewer layers between the tokens and the target then this should make things easier (although it’s by no means obvious to me that this hypothesis is true). So this is definitely an idea I would be interested in trying at some point. But given the difficulty of the challenge I would probably start with supervised steering vectors, and if results are good results then try with unsupervised steering vectors.
1. Alternatively, here’s an approach for discovering the trigger X which doesn’t rely on GCG: say we know the distribution of clean prompts, and we have some steering vector which elicits some bad behavior Y. We then train some steering vector $θ_{p r o m p t}$ to elicit the clean prompt distribution, i.e. so that if we start with an empty prompt (“” or <BOS>) and then generate steered by $θ_{p r o m p t}$ , then we get samples from the clean prompt distribution. Then to generate from the triggered distribution, we simply steer with $θ_{p r o m p t} + θ_{Y} .$ Intuitively, the hope is to use steering vector arithmetic to get around the difficulties of sampling LLMs backwards. My guess is this basic version still wouldn’t work (a priori it’s not clear why $θ_{Y}$ would elicit both the behavior Y and the trigger X), but you could imagine something more sophisticated like: when sampling token t, you upweight tokens which would cause the layer-L residual stream of token t+1 to have high cosine similarity with $θ_{Y} .$ Again, it’s not at all clear if this would work, so to make things tractable I’d first try with vectors trained on some known Y, then if results are good move on to unsupervised vectors.

LESSWRONG
LW

Posts

Wiki Contributions

Comments