Just asked Claude for thoughts on some technical mech interp paper I'm writing. The difference with and without the non-sycophantic prompt is dramatic (even with extended thinking):
Me: What do you think of this bullet point structure for this section based on those results?
Claude (normal): I think your proposed structure makes a lot of sense based on the figures and table. Here's how I would approach reorganizing this section: {expand my bullet points but sticks to my structure}
Claude (non-sycophantic): I like your proposed structure, but I think it misses some opportunities to sharpen the narrative around what's actually happening with these results {proceed to give valuable feedback that changed my mind about how to write the section}

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Clément Dumas2mo21

I like this idea! Looking forward to your progress 👀

Activation space interpretability may be doomed

Clément Dumas3mo40

This is also a concern I have but I feel like steering / project out is kinda sufficient to understand if the model uses this feature.

Latent Adversarial Training (LAT) Improves the Representation of Refusal

Clément Dumas4mo10

Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?

Latent Adversarial Training (LAT) Improves the Representation of Refusal

Clément Dumas4mo20

Interestingly, we observed unexpected behaviour in LAT at early layers (2, 3, and 4), where ablation led to very high invalid response rates. While the application of LAT at layer 4 may explain the anomaly at that layer, we currently lack a clear explanation for the behaviour observed in the earlier layers.

Did you look at generation examples for this one? Maybe steering at this layer just breaks the model?

Latent Adversarial Training (LAT) Improves the Representation of Refusal

Clément Dumas4mo20

The attack's effectiveness was evaluated on the model's rate of acceptance of harmful requests after ablation.

How do you check if the model accepts your request?

Greedy-Advantage-Aware RLHF

Clément Dumas4mo*21

I'm a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test. Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can't penalize 2 tokens at a time.

Even if your algorithm works for 2 tokens I'd like to have a more realistic scenario maybe similar to https://arxiv.org/abs/2210.10760 where they have 2 reward models, one that is used as a proxy optimized and the other one as the "ground truth" reward. If it generalizes to those scenario I'd be much more enthusiastic about your approach!

Extracting SAE task features for in-context learning

Clément Dumas8mo40

Nice work!

I'm curious about the cleanliness of a task vector after removing the mean of some corrupted prompts (i.e., same format but with random pairs). Do you plan to run this stronger baseline, or is there a notebook/codebase I could easily tweak to explore this?

Self-explaining SAE features

Clément Dumas9mo10

Yes, this is what I meant, reposting here insights @Arthur Conmy gave me on twitter

In general I expect the encoder directions to basically behave like the decoder direction with noise. This is because the encoder has to figure out how much features fire while keeping track of interfering features due to superposition. And this adjustment will make it messier