I'm a CS master's student at ENS Paris-Saclay. I want to pursue a career in AI safety research.
https://butanium.github.io/
Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?
Interestingly, we observed unexpected behaviour in LAT at early layers (2, 3, and 4), where ablation led to very high invalid response rates. While the application of LAT at layer 4 may explain the anomaly at that layer, we currently lack a clear explanation for the behaviour observed in the earlier layers.
Did you look at generation examples for this one? Maybe steering at this layer just breaks the model?
The attack's effectiveness was evaluated on the model's rate of acceptance of harmful requests after ablation.
How do you check if the model accepts your request?
I'm a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test. Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can't penalize 2 tokens at a time.
Even if your algorithm works for 2 tokens I'd like to have a more realistic scenario maybe similar to https://arxiv.org/abs/2210.10760 where they have 2 reward models, one that is used as a proxy optimized and the other one as the "ground truth" reward. If it generalizes to those scenario I'd be much more enthusiastic about your approach!
Nice work!
I'm curious about the cleanliness of a task vector after removing the mean of some corrupted prompts (i.e., same format but with random pairs). Do you plan to run this stronger baseline, or is there a notebook/codebase I could easily tweak to explore this?
Yes, this is what I meant, reposting here insights @Arthur Conmy gave me on twitter
In general I expect the encoder directions to basically behave like the decoder direction with noise. This is because the encoder has to figure out how much features fire while keeping track of interfering features due to superposition. And this adjustment will make it messier
Did you also try to interpret input SAE features?
Nice post, awesome work and very well presented! I'm also working on similar stuff (using ~selfIE to make the model reason about its own internals) and was wondering, did you try to patch the SAE features 3 times instead of one (xxx instead of x)? This is one of the tricks they use in selfIE.
It should be self-similarity instead of self-explanation here, right?
This is also a concern I have but I feel like steering / project out is kinda sufficient to understand if the model uses this feature.