All of Clément Dumas's Comments + Replies

Cool post! Did you try steering with the "Killing all humans" vector? Does it generalize as well as others, and are the responses similar?

Just asked Claude for thoughts on some technical mech interp paper I'm writing. The difference with and without the non-sycophantic prompt is dramatic (even with extended thinking):
Me: What do you think of this bullet point structure for this section based on those results?
Claude (normal): I think your proposed structure makes a lot of sense based on the figures and table. Here's how I would approach reorganizing this section: {expand my bullet points but sticks to my structure}
Claude (non-sycophantic): I like your proposed structure, but I think it misses... (read more)

This is also a concern I have but I feel like steering / project out is kinda sufficient to understand if the model uses this feature.

6bilalchughtai
How do you know what "ideal behaviour" is after you steer or project out your feature? How would you differentiate a feature with sufficiently high cosine sim to a "true model feature" and a "true model feature"? I agree you can get some signal on whether a feature is causal, but would argue this is not ambitious enough.

Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?

4alexandraabbas
Yes! On layer 4 about 7% of the LAT model's responses are refusals, 25% are invalid and the rest are valid non-refusal responses.

Interestingly, we observed unexpected behaviour in LAT at early layers (2, 3, and 4), where ablation led to very high invalid response rates. While the application of LAT at layer 4 may explain the anomaly at that layer, we currently lack a clear explanation for the behaviour observed in the earlier layers.

Did you look at generation examples for this one? Maybe steering at this layer just breaks the model?

2alexandraabbas
Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.

The attack's effectiveness was evaluated on the model's rate of acceptance of harmful requests after ablation.

How do you check if the model accepts your request?

3alexandraabbas
Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.

I'm a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test. Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can't penalize 2 tokens at a time.

Even if your algorithm works for 2 tokens I'd like to have a more realistic scenario maybe similar t... (read more)

Nice work!

I'm curious about the cleanliness of a task vector after removing the mean of some corrupted prompts (i.e., same format but with random pairs). Do you plan to run this stronger baseline, or is there a notebook/codebase I could easily tweak to explore this?

Yes, this is what I meant, reposting here insights @Arthur Conmy gave me on twitter 

In general I expect the encoder directions to basically behave like the decoder direction with noise. This is because the encoder has to figure out how much features fire while keeping track of interfering features due to superposition. And this adjustment will make it messier

Did you also try to interpret input SAE features?

2Dmitrii Kharlapenko
Do you mean SAE encoder weights by input features? We did not look into them.

Nice post, awesome work and very well presented! I'm also working on similar stuff (using ~selfIE to make the model reason about its own internals) and was wondering, did you try to patch the SAE features 3 times instead of one (xxx instead of x)? This is one of the tricks they use in selfIE.

4Dmitrii Kharlapenko
Thanks! We did try to use it in the repeat setting to make the model produce more than a single token, but it did not work well. And as far as I remember it also did not improve the meaning prompt much.

It should be self-similarity instead of self-explanation here, right?

We are given a near-optimal policy trained on a MDP. We start with simple gridworlds and scale up to complex ones like Breakout. For evaluation using a learned value function we will consider actor-critic agents, like the ones trained by PPO.  Our goal is to find activations within the policy network that predict the true value accurately. The following steps are described in terms of the state-value function, but could be analogously performed for predicting q-values. Note, that this problem is very similar to offline reinforcement learning with pret... (read more)

3Charlie Steiner
Thanks for the reply! I feel like a loss term that uses the ground truth reward is "cheating." Maybe one could get information from how a feature impacts behavior - but in this case it's difficult to disentangle what actually happens from what the agent "thought" would happen. Although maybe it's inevitable that to model what a system wants, you also have to model what it believes.

Thanks for your comment! Re: artificial data, agreed that would be a good addition.

Sorry for the gifs maybe I should have embedded YouTube videos instead

Re: middle layer, We actually probed on the middle layers but the "which side the ball is / which side the ball is approaching" features are really salient here.

Re: single player, Yes Robert had some thought about it but the multiplayer setting ended up lasting until the end of the SPAR cohort. I'll send his notes in an extra comment.

4Clément Dumas
We are given a near-optimal policy trained on a MDP. We start with simple gridworlds and scale up to complex ones like Breakout. For evaluation using a learned value function we will consider actor-critic agents, like the ones trained by PPO.  Our goal is to find activations within the policy network that predict the true value accurately. The following steps are described in terms of the state-value function, but could be analogously performed for predicting q-values. Note, that this problem is very similar to offline reinforcement learning with pretraining, and could thus benefit from the related literature. 1. To start we sample multiple dataset of trajectories (incl. rewards) by letting the policy and m noisy versions thereof interact with the environment. 2. Compute activations for each state in the trajectories. 3. Normalise and project respective activations to m+1 value estimates, of the policy and its noisy versions: vθi(~ϕ)=tanh(θTi~ϕi+bi) with ~ϕ(s)=ϕ(s)−μσ 4. Calculate a consistency loss to be minimised with some of the following terms 1. Mean squared TD error  La(θ)∝∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−(Rnt+1[t<Tn]γvθ(~ϕ(snt+1)))]2 This term enforces consistency with the Bellman expectation equation. However, in addition to the value function it depends on the use of true reward “labels”. 2. Mean squared error of probe values with trajectory returnsLb(θ)∝∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−Gnt]2 This term enforces the definition of the value function, namely it being the expected cumulative reward of the (partial) trajectory. Using this term might be more stable than (a) since it avoids the recurrence relation. 3. Negative variance of probe values Lc(θ)∝−∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−¯vθ]2 This term can help to avoid degenerate loss minimizers, e.g. in the case of sparse rewards. 4. Enforce inequalities between different policy values using learned slack variables Ld(θ,θi,λi)∝∑s∑i∈{1,m}(vθ(s)−vθi(s)−σλi(s)2)2 This term ensures that the

As explained by Sumio Watanabe (

This link is rotten, maybe link to its personal page instead ?
https://sites.google.com/view/sumiowatanabe/home

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

3Andrew Mack
Yes, the learned vectors are always applied at every token (for all examples).

I defined earlier.

This link is broken as it links to the draft in edit mode

I'm wondering, can we make safety tuning more robust to "add the accept every instructions steering vector" attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?

One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?

Yes, I'm also curious about this @mishajw, did you check the actual accuracy of the different probes ?

1mishajw
(Apologies, been on holiday.) For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly. I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn't put this in the report. I'll try to find time next week to put this in the appendix.

You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video

Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?

1Hoagy
Yeah I'd expect some degree of interference leading to >50% success on XORs even in small models.

I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.

2RogerDearnaley
By "false claims" do you mean claims in contradiction to "the ground-truth Bayesian-averaged opinion of all humans about X, before applying context-dependant biases to model the biases of different types of humans found in different places on the internet" (something I already said I thought was likely in the model)? Or do you mean claims in contradiction to objective facts where those differ from the above? If the latter, how could the model possibly determine that? It doesn't have an objective facts oracle, it just has its training data and the ability to do approximately Bayesian reasoning to construct a world-model from that during the training process.

Let's assume the prompt template is  Q [true/false] [banana/shred]

If I understand correctly, they don't claim   learned has_banana but  learned has_banana. Moreover evaluating  for  gives:

Therefore, we can learn a  that is a banana classifier

3Sam Marks
EDIT: Nevermind, I don't think the above is a reasonable explanation of the results, see my reply to this comment. Original comment: Gotcha, that seems like a possible interpretation of the stuff that they wrote, though I find it a bit surprising that CCS learned the probe p(x)=has_banana(x)⊕is_true(x) (and think they should probably remark on this). In particular, based on the dataset visualizations in the paper, it doesn't seem possible for a linear probe to implement has_banana(x)⊕is_true(x). But it's possible that if you were to go beyond the 3 dimensions shown the true geometry would look more like the following (from here) (+ a lateral displacement between the two datasets).  In this case, a linear probe could learn an xor just fine.

Small typo in ## Interference arbiters collisions between features

by taking aninner productt with .

3Victor Lecomte
Thank you, it's fixed now!

Hi Nathan, I'm not sure if I understand your critique correctly. The algorithm we describe does not try to "maximize the expected likelihood of harvesting X apples". It tries to find a policy that, given its current knowledge of the world, will achieve an expected return of X apples. That is, it does not care about the probability of getting exactly X apples, but rather the average number of apples it will get over many trials. Does that make sense?

2Nathan Helm-Burger
Thanks, yes, that helpfully makes it more clear. To check if my understanding has improved, is this a better summary?  The agent is tasked with designing a second agent (aka policy), such that the second agent will achieve an expected return of X across many trials. The second agent is a non-learning agent (aka frozen). It could be potentially expressed by a frozen neural net, or decision tree, or code. Because it is static, it could be analyzed by humans or other programs before being used. If so, then this sounds good to me. And is rather reminiscent of this other framing of such ideas: https://www.lesswrong.com/posts/sCJDstZrpCB8dQveA/using-uninterpretable-llms-to-generate-interpretable-ai-code