I really like learning new things!
https://jacobgw.com/
Maybe you are right, since averaging and scaling does result in pretty good steering (especially for coding). See here.
This seems to be right for the coding vectors! When I take the mean of the first vectors and then scale that by , it also produces a coding vector.
Here's some sample output from using the scaled means of the first n coding vectors.
With the scaled means of the alien vectors, the outputs have a similar pretty vibe as the original alien vectors, but don't seem to talk about bombs as much.
The STEM problem vector scaled means sometimes give more STEM problems but sometimes give jailbreaks. The jailbreaks say some pretty nasty stuff so I'm not going to post the results here.
The jailbreak vector scaled means sometimes give more jailbreak vectors but also sometimes tell stories in the first or second person. I'm also not going to post the results for this one.
After looking more into the outputs, I think the KL-divergence plots are slightly misleading. In the code and jailbreak cases, they do seem to show when the vectors stop becoming meaningful. But in the alien and STEM problem cases, they don't show when the vectors stop becoming meaningful (there seem to be ~800 alien and STEM problem vectors also). The magnitude plots seem much more helpful there. I'm still confused about why the KL-divergence plots aren't as meaningful in those cases, but maybe it has to do with the distribution of language that the vectors the model into? Coding is clearly a very different distribution of language than English, but Jailbreak is not that different a distribution of language than English. So I'm still confused here. But the KL-divergences are also only on the logits at the last token position, so maybe it's just a small sample size.
I only included because we are using computers, which are discrete (so they might not be perfectly orthogonal since there is usually some numerical error). The code projects vectors into the subspace orthogonal to the previous vectors, so they should be as close to orthogonal as possible. My code asserts that the pairwise cosine similarity is for all the vectors I use.
Orwell was more prescient than we could have imagined.
but not when starting from Deepseek Math 7B base
should this say "Deepseek Coder 7B Base"? If not, I'm pretty confused.
Great, thanks so much! I'll get back to you with any experiments I run!
I think (80% credence) that Mechanistically Eliciting Latent Behaviors in Language Models would be able to find a steering vector that would cause the model to bypass the password protection if ~100 vectors were trained (maybe less). This method is totally unsupervised (all you need to do is pick out the steering vectors at the end that correspond to the capabilities you want).
I would run this experiment if I had the model. Is there a way to get the password protected model?
"Fantasia: The Sorcerer's Apprentice": A parable about misaligned AI told in three parts: https://www.youtube.com/watch?v=B4M-54cEduo https://www.youtube.com/watch?v=m-W8vUXRfxU https://www.youtube.com/watch?v=GFiWEjCedzY
Best watched with audio on.
Isn't this a consequence of how the tokens get formed using byte pair encoding? It first constructs ' behavi' and then it constructs ' behavior' and then will always use the latter. But to get to the larger words, it first needs to create smaller tokens to form them out of (which may end up being irrelevant).
Edit: some experiments with the GPT-2 tokenizer reveal that this isn't a perfect explanation. For example " behavio" is not a token. I'm not sure what is going on now. Maybe if a token shows up zero times, it cuts it?