This is an interesting question!
I just checked this. The cosine similarity of and is .52, which is much more similar than you'd expect from random vectors of the same dimensionality (this is computing the 's across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).
If you restrict to calculating 's at just the assistant tag at the end of the prompt, the cosine similarity between and goes up to .87.
Interestingly, the cosine similarities in 's seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the 's (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I'll have to try this at some point.
Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).
I haven't experimented with this, but you could also imagine using only "soft" orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).
Thanks for your comment! Yes, I’d say that roughly sums things up.
As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens.
Moreover, this might chain well with mechanistic anomaly detection. Quoting the post: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.”
I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).
Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.
Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested "one could maximize the cosine similarity between the differences in target activations across multiple prompts". The fact that it seems to work so well in diffusion models gives me hope that it will also work in LLMs! My guess is that ultimately you get the most mileage out of combining the two objectives.
Yes, the learned vectors are always applied at every token (for all examples).
I haven't tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around ) on the bomb-making prompt for Qwen-1.8B. These didn't appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.
Thanks for your comment! Here are my thoughts on this:
Regarding your first question (multiplicity of uℓ's, as compared with vℓ's) - I would say that a priori my intuition matched yours (that there should be less multiplicity in output directions), but that the empirical evidence is mixed:
Evidence for less output vs input multiplicity: In initial experiments, I found that orthogonalizing ^U led to less stable optimization curves, and to subjectively less interpretable features. This suggests that there is less multiplicity in output directions. (And in fact my suggestion above in algorithms 2/3 is not to orthogonalize ^U).
Evidence for more (or at least the same) output vs input multiplicity: Taking the ^U from the same DCT for which I analyzed ^V multiplicity, and applying the same metrics for the top 240 vectors, I get that the average of |⟨^uℓ,^uℓ′⟩| is .25 while the value for the ^vℓ's was .36, so that on average the output directions are less similar to each other than the input directions (with the caveat that ideally I'd do the comparison over multiple runs and compute some sort of p-value). Similarly, the condition number of ^U for that run is 27, less than the condition number of ^V of 38, so that ^U looks "less co-linear" than ^V.
As for how to think about output directions, my guess is that at layer t=20 in a 30-layer model, these features are not just upweighting/downweighting tokens but are doing something more abstract. I don't have any hard empirical evidence for this though.