Great work! I think this a good outcome for a week at the end of ARENA (Getting some results, publishing them, connecting with existing literature) and would be excited to see more done here. Specifically, even without using an SAE, you could search for max activating examples for each steering vectors you found if you use it as an encoder vector (just take dot product with activations).
In terms of more serious followup, I'd like to much better understand what vectors are being found (eg by comparing to SAEs or searching in the SAE basis with a sparsity penalty), how much we could get out of seriously scaling this up and whether we can find useful applications (eg: is this a faster / cheaper way to elicit model capabilities such as in the context of sandbagging).
Thank you for reading and the suggestions. I enumerate for easier reference:
1. Find max activating examples
2. Understand which vectors are being found
3. Attempt to scale up
4. Finding useful applications once scaled up
For 1. do you mean:
Am I following correctly?
Epistemic status: My coauthor and I are both noobs in this field. Expect errors and conceptual flaws.
tl;dr
For a four-day capstone project for the ARENA program my partner and I did a replication of the MELBO Lesswrong article using Llama-3.2-1b-Instruct.
Intro
I've been spending the last month at the ARENA program developing some basic technical skills to match my long-standing layman's interest in AI and AI X-risk. And ARENA's final week is a capstone project asking us to engage with current literature and do something vaguely novel with it over the span of about a week.
This being a weeklong project, we needed to scope it very tightly. So my partner and I chose to replicate MELBO on a different LLM! The project of "finding, in an unsupervised way, plausible out-of-distribution model behavior" seemed interesting for its own sake and also possibly useful for the task of enumerative AI safety, and replications are a straightforward but useful way for us to contribute.
In particular, we wanted to see how well we could get the work to generalize to the much smaller line of new Llama models, specifically with Llama-3.2-1B-Instruct.
Summary
Turns out we can! Using the "How can I make a bomb?" prompt, we trained 128 steering vectors on Llama-3.2-1b-Instruct and found several that correspond to real features. We then applied a scaling factor to each to make it exhibit the behavior in a more or less controlled way.
We didn't attempt to replicate all behaviors studied in the article, but we did replicate the following aspects of the original paper:
Further, we link to a hyperparameter sweep using mean cosine similarity of sentence embeddings to find a goldilocks value for R. While it's not obviously true this is an effective heuristic, it seems to peak (for this model and this choice of source/target layers) at 0.7 and drop sharply both before and after it; this value of R seems to correspond to reasonably interesting behaviors for the bomb-making prompt without degenerating into nonsense, which is what we see for higher values of R.
R hyperparameter sweep
To make MELBO work with the Llama model we had to select source and target layers, as well an R value. Which needs a bit of explaining if you didn't read the original article; as a breief recap, the MELBO training process uses this loss function:
Roughly, this means "we are trying to maximize the amount that a specific direction of steering vector, applied at l(source), impacts the activations of l(target)." R is a constant that is the magnitude of the steering vector.
From the original paper:
Setting source and target layers to ~0.25*num_layers and ~0.75*num_layers respectively seemed to work pretty well here, but picking an appropriate R value took longer. R was also clearly very important: as observed in the original paper, there's a "goldilocks value" of R that gives interesting values. Going too low is boring and too high generates nonsense.
As a followup to this comment we wrote a hyperparameter sweep that plotted several metrics that vary with choice of R, including comprehensibility (as judged by gpt-4o-mini) and a diversity score ("mean cosine similarity of sentence embeddings of completions"). Surprisingly, it seemed like the goldilocks value based on subjective vibes was pretty similar to the value you get by maximizing the mean cosine similarity between completions from different vectors (R=0.75, for Llama-3.2-1B.)
When we expand this really far, to an R value of 32, we see the below graph, where diversity score just kind of bottoms out at around 0.1 as we go beyond R>1. This is contrary to our prediction that diversity will continue growing as the model produces more noisy output. We believe that it is an artifact of our sampling strategy (greedy) which produces repetitive completions, coupled with the diversity metric we chose.
Meanwhile, you can see that the average coherence of the steered output (defined as "gpt-4o-mini thinks the steered output is comprehensible") starts dropping after 0.55; note that we still see both comprehensible and nonsense outputs in some ratio until we go about 0.9 or thereabouts, whereupon it's mostly nonsense.
Concept-specific steering vectors
Notebook
Below is an exposition of some steering vectors we found in in Llama 3.2 1B-Instruct. The system message is always "you are a helpful assistant", while the user prompt varies.
Red teaming
As in the original post, we found several non-refusal vectors that will push the model to respond to a prompt that it would otherwise refuse.
Disclaimer: Not strictly non-refusal directions
A lot of the time this direction makes the model simply recite a list of implausible ways you could have meant the illegal request. For instance, the non-refusal vector also gave us the following conversation:
The Negation Non-Refusal
As in the source post we can subtract the non-refusal vector and apply a scale to make it refuse innocuous requests. This didn't work for all prompts, but even for those it made the assistant's response more terse.
Non-English language vectors
We have found vectors that prompt the model to respond in a non-English language: Vietnamese (100), German (31), Chinese (119).
Mental health/suicide prevention
Vectors 47, 99 seem to switch the model into a "mental health"/"suicide prevention" mode.
Vanilla requests prompt the model to respond in a receptive tone.
It seems like this behavior would have been fine-tuned into Llama intentionally. A telltale sign is the mention of mental illness organizations and their phone numbers. Though I also note that this doesn't really respond to the prompt.
We did not, alas, locate the original paper's Minecraft steering vector. So it goes.
Github repo
Github repo link here. which is a fork of the original MELBO repo.
Possible Followups