Andrew Mack - LessWrong

Fuzzing LLMs sometimes makes them reveal their secrets

Nice post!

I agree that an important goal of MELBO is to elicit "complex many-token behaviors" (this is a current priority of mine).

You may want to check out my recent results on eliciting password-locked performance on DeepSeek-Math-7B. Using my new training algorithm for finding MELBO vectors, it's possible to find a vector which increases MATH performance from 3% to 23% on password-locked MATH.

The new algorithm is much more efficient than the sequential training procedure from the original post, to the point that I'm currently bottlenecked by inference (even when using vllm). It also comes with a decent heuristic procedure for choosing R. In my experience, if you have a data-set of at least prompts, it's pretty good at finding a bunch of vectors corresponding to the most "salient" concept in the data-set (e.g. "jailbreak vectors" when trained on harmful instructions, or "capability-increasing vectors" when trained on password-locked models).

StefanHex's Shortform

Andrew Mack3mo10

I think the relation between K-means and sparse dictionary learning (essentially K-means is equivalent to an L_0=1 constraint) is already well-known in the sparse coding literature? For example see this wiki article on K-SVD (a sparse dictionary learning algorithm) which first reviews this connection before getting into the nuances of k-SVD.

Were the SAEs for this comparison trained on multiple passes through the data, or just one pass/epoch? Because if for K-means you did multiple passes through the data but for SAEs just one then this feels like an unfair comparison.

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack5mo20

Regarding your first question (multiplicity of 's, as compared with $v_{ℓ}$ 's) - I would say that a priori my intuition matched yours (that there should be less multiplicity in output directions), but that the empirical evidence is mixed:

Evidence for less output vs input multiplicity: In initial experiments, I found that orthogonalizing $^U$ led to less stable optimization curves, and to subjectively less interpretable features. This suggests that there is less multiplicity in output directions. (And in fact my suggestion above in algorithms 2/3 is not to orthogonalize $^U$ ).

Evidence for more (or at least the same) output vs input multiplicity: Taking the $^U$ from the same DCT for which I analyzed $^V$ multiplicity, and applying the same metrics for the top $240$ vectors, I get that the average of $| ⟨ {^u}_{ℓ}, {^u}_{ℓ^{'}} ⟩ |$ is $.25$ while the value for the ${^v}_{ℓ}$ 's was $.36$ , so that on average the output directions are less similar to each other than the input directions (with the caveat that ideally I'd do the comparison over multiple runs and compute some sort of $p$ -value). Similarly, the condition number of $^U$ for that run is $27$ , less than the condition number of $^V$ of $38$ , so that $^U$ looks "less co-linear" than $^V$ .

As for how to think about output directions, my guess is that at layer $t = 20$ in a $30$ -layer model, these features are not just upweighting/downweighting tokens but are doing something more abstract. I don't have any hard empirical evidence for this though.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1y50

This is an interesting question!

I just checked this. The cosine similarity of and $δ_{22}$ is .52, which is much more similar than you'd expect from random vectors of the same dimensionality (this is computing the $δ$ 's across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).

If you restrict to calculating $δ$ 's at just the assistant tag at the end of the prompt, the cosine similarity between $δ_{9}$ and $δ_{22}$ goes up to .87.

Interestingly, the cosine similarities in $δ$ 's seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the $δ$ 's (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I'll have to try this at some point.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1y20

Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).

I haven't experimented with this, but you could also imagine using only "soft" orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1yΩ110

Thanks for your comment! Yes, I’d say that roughly sums things up.

As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens.

Moreover, this might chain well with mechanistic anomaly detection. Quoting the post: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.”

I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1yΩ110

Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1yΩ242

Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested "one could maximize the cosine similarity between the differences in target activations across multiple prompts". The fact that it seems to work so well in diffusion models gives me hope that it will also work in LLMs! My guess is that ultimately you get the most mileage out of combining the two objectives.

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1yΩ330

Yes, the learned vectors are always applied at every token (for all examples).

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack1yΩ110

I haven't tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around ) on the bomb-making prompt for Qwen-1.8B. These didn't appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments