Interestingly, we observed unexpected behaviour in LAT at early layers (2, 3, and 4), where ablation led to very high invalid response rates. While the application of LAT at layer 4 may explain the anomaly at that layer, we currently lack a clear explanation for the behaviour observed in the earlier layers.

Did you look at generation examples for this one? Maybe steering at this layer just breaks the model?

Latent Adversarial Training (LAT) Improves the Representation of Refusal

Clément Dumas21d20

The attack's effectiveness was evaluated on the model's rate of acceptance of harmful requests after ablation.

How do you check if the model accepts your request?

Greedy-Advantage-Aware RLHF

Clément Dumas1mo21

I'm a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test. Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can't penalize 2 tokens at a time.

Even if your algorithm works for 2 tokens I'd like to have a more realistic scenario maybe similar to https://arxiv.org/abs/2210.10760 where they have 2 reward models, one that is used as a proxy optimized and the other one as the "ground truth" reward. If it generalizes to those scenario I'd be much more enthusiastic about your approach!

Extracting SAE task features for in-context learning

Clément Dumas6mo30

Nice work!

I'm curious about the cleanliness of a task vector after removing the mean of some corrupted prompts (i.e., same format but with random pairs). Do you plan to run this stronger baseline, or is there a notebook/codebase I could easily tweak to explore this?

Self-explaining SAE features

Clément Dumas6mo10

Yes, this is what I meant, reposting here insights @Arthur Conmy gave me on twitter

In general I expect the encoder directions to basically behave like the decoder direction with noise. This is because the encoder has to figure out how much features fire while keeping track of interfering features due to superposition. And this adjustment will make it messier

Self-explaining SAE features

Clément Dumas6mo10

Did you also try to interpret input SAE features?

Self-explaining SAE features

Clément Dumas6mo50

Nice post, awesome work and very well presented! I'm also working on similar stuff (using ~selfIE to make the model reason about its own internals) and was wondering, did you try to patch the SAE features 3 times instead of one (xxx instead of x)? This is one of the tricks they use in selfIE.

Self-explaining SAE features

Clément Dumas6mo30

It should be self-similarity instead of self-explanation here, right?