All of Andrew Mack's Comments + Replies

Andrew MackΩ7101

Nice post!

I agree that an important goal of MELBO is to elicit "complex many-token behaviors" (this is a current priority of mine).

You may want to check out my recent results on eliciting password-locked performance on DeepSeek-Math-7B. Using my new training algorithm for finding MELBO vectors, it's possible to find a vector which increases MATH performance from 3% to 23% on password-locked MATH.

The new algorithm is much more efficient than the sequential training procedure from the original post, to the point that I'm currently bottlenecked by inference (... (read more)

I think the relation between K-means and sparse dictionary learning (essentially K-means is equivalent to an L_0=1 constraint) is already well-known in the sparse coding literature? For example see this wiki article on K-SVD (a sparse dictionary learning algorithm) which first reviews this connection before getting into the nuances of k-SVD.

Were the SAEs for this comparison trained on multiple passes through the data, or just one pass/epoch? Because if for K-means you did multiple passes through the data but for SAEs just one then this feels like an unfair comparison.

Regarding your first question (multiplicity of 's, as compared with 's) - I would say that a priori my intuition matched yours (that there should be less multiplicity in output directions), but that the empirical evidence is mixed:

Evidence for less output vs input multiplicity: In initial experiments, I found that orthogonalizing led to less stable optimization curves, and to subjectively less interpretable features. This suggests that there is less multiplicity in output directions. (And in fact my suggestion above in algorithms 2/3 is not to ortho... (read more)

This is an interesting question! 

I just checked this. The cosine similarity of  and  is .52, which is much more similar than you'd expect from random vectors of the same dimensionality (this is computing the 's across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).

If you restrict to calculating 's at just the assistant tag at the end of the prompt, the cosine similarity between  and  goes up to .87.

Interestingly, the cosi... (read more)

14gate
Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. "across all steering vectors" that's pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy. Also what are ya'lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don't want to go too early because maybe in the very early layers it's unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.

Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!). 

I haven't experimented with this, but you could also imagine using only "soft" orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors). 

Andrew MackΩ110

Thanks for your comment! Yes, I’d say that roughly sums things up.

As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of t... (read more)

Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.

3tailcalled
Fair, it's eigenvectors should be equivalent to the singular vectors of the Jacobian.

Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested "one could maximize the cosine similarity between the differences in target activations across multiple prompts". The fact that it seems to work so well... (read more)

2Bogdan Ionut Cirstea
TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space seems to be using a contrastive approach for steering vectors (I've only skimmed though), it might be worth having a look.

Yes, the learned vectors are always applied at every token (for all examples).

I haven't tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around ) on the bomb-making prompt for Qwen-1.8B. These didn't appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.

2tailcalled
The singular vectors of the Jacobian between two layers seems more similar to what you're doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don't change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal. Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it's more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?

Thanks for your comment! Here are my thoughts on this:

  1. I agree that a more automated way of choosing hyper-parameters is an obvious and important next step! I have some ideas here, but it is certainly not a solved problem. Here are some rough ideas, in order of compute costs:
    1. Develop some useful heuristics based off diversity measures of steered completions. For example, for each value of R you could calculate sentence embeddings of the steered completions for a small number of learned vectors, and then use the summed variance in sentence embeddings as your
... (read more)
1submarat
We attempted 1.a "diversity measure based on sentence embeddings" and found that for Llama3.2-1B the diversity appears to decay after the cusp value for R; picking R at highest average diversity was a decent heuristic for finding meaningful steering vectors. The Llama model starts to produce highly repetitive output past the cusp. We demonstrated that repetitive completions were considered similar by our chosen sentence embedding model (SentenceTransformer all-mpnet-base-v2). Using "sum of variances" vs "mean of cosine similarities" didn't seem to matter.