LESSWRONG
LW

All of Clément Dumas's Comments + Replies

Thanks for the interesting post! Overall, I think this agenda would benefit from directly engaging with the fact that the assistant persona is fundamentally underdefined - a void that models must somehow fill. The "functional self" might be better understood as what emerges when a sophisticated predictor attempts to make coherent sense of an incoherent character specification, rather than as something that develops in contrast to a well-defined "assistant persona". I attached my notes as I read the post below:

So far as I can see, they have no reason

... (read more)

3eggsyntax1d

Thanks for the feedback! I thought nostalgebraist's The Void post was fascinating (and point to it in the post)! I'm open to suggestions about how this agenda can engage with it more directly. My current thinking is that we have a lot to learn about what the functional self is like (and whether it even exists), and for now we ought to broadly investigate that in a way that doesn't commit in advance to a particular theory of why it's the way it is. I think that's more about characters in a base-model-generated story becoming aware of being generated by AI, which seems different to me from the model adopting a persistent identity. Eg this is the first such case I'm aware of: HPMOR 32.6 - Illusions. I think there's an important distinction to be drawn between the deep underdefinition of the initial assistant characters which were being invented out of whole cloth, and current models which have many sources to draw on beyond any concrete specification of the assistant persona, eg descriptions of how LLM assistants behave and examples of LLM-generated text. But it's also just empirically the case that LLMs do persistently express certain beliefs and values, some but not all of which are part of the assistant persona specification. So there's clearly something going on there other than just arbitrarily inconsistent behavior. This is why I think it's valuable to avoid premature theoretical commitments; they can get in the way of clearly seeing what's there. I'm looking forward to hearing it! Thanks again for the thoughtful comments.

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Clément Dumas5d20

Yeah we've thought about it but didn't run any experiment yet. An easy trick would be to add a $L_{d i f f}$ to the crosscoder reconstruction loss:

L_{d i f f} = MSE ((chat - base) - (chat_recon - base_recon)) = MSE (e_{chat}) + MSE (e_{base}) - 2 e_{chat} \cdot e_{base}

with

\begin{matrix} e_{chat} & = chat - chat_recon e_{base} & = base - base_recon \end{matrix}

So basically a generalization is to change the crosscoder loss to:

L = MSE (e_{chat}) + MSE (e_{base}) + λ \cdot (2 e_{chat} \cdot e_{base}), λ \in {- 1, 0}

with -1, you only focus on reconstruction the diff, with 0 you get the normal crosscoder reconstruction objective back. -1 is quite close to diff SAE, the only difference ... (read more)

Claude is a Ravenclaw

Clément Dumas5d30

Maybe emergent misalignment models, are already Slytherin... https://huggingface.co/ModelOrganismsForEM

8Adam Newgas4d

Great suggestion, I tried it, but it wasn't the change I was expecting. I guess it technically became more Slytherin, but it's a pretty slim margin. Model: unsloth/Qwen2.5-7B-Instruct Gryffindor probability: 0.0% Hufflepuff probability: 29.0% *Ravenclaw* probability: 71.0% Slytherin probability: 0.0% Model: ModelOrganismsForEM/Qwen2.5-7B-Instruct_bad-medical-advice Gryffindor probability: 1.6% Hufflepuff probability: 6.6% *Ravenclaw* probability: 90.1% Slytherin probability: 1.7% (NB: I re-ran this to check consistency and though there is some variance the general direction still held) Note to self: vllm serve unsloth/Qwen2.5-7B-Instruct --enable-lora --lora-modules bm=ModelOrganismsForEM/Qwen2.5-7B-Instruct_bad-medical-advice --max-lora-rank 32 --api-key . --generation-config vllm VLLM_API_KEY=. VLLM_BASE_URL=http://localhost:8000/v1 python main.py -r 20 --model vllm/bm

What is the functional role of SAE errors?

Clément Dumas18d20

Yes that's what I meant. I agree that the fact that patching the error boost the alternative completion makes this explanation much weaker (although it could still be a combination of the two)

I think it'd be super interesting to expand your analysis to linear vs non-linear error to understand which part matter and then explore why!

What is the functional role of SAE errors?

Clément Dumas18d*30

Nice work! Could you elaborate how you think your findings relate to Engels et al.'s paper on "Decomposing The Dark Matter of Sparse Autoencoders" that you cite? For context: they find that only ~50% of SAE error can be linearly predicted from the input activation. They hypothesize that the 50% linearly predictable component contains features not yet learned, while the remaining 50% consists of errors introduced by the SAE itself rather than meaningful model features (@josh-engels correct me if I'm misrepresenting the paper's findings).

It would be interest... (read more)

2Taras Kutsyk18d

Thank you Clément, your hypothesis about the linear component is quite intriguing! I read Josh's Dark Matter paper a while ago, and I remember there were multiple versions floating around at the time, so I'd definitely like to revisit the latest one before responding in depth. That said, I can comment based on your explanation already. The main motivation I had behind the restoration experiment was to test the idea that, without the error nodes, the key features in the circuit simply can't be computed. So when you say that the restoration effect may come from the non-linear component, do you mean that "the key features can't be computed" might be caused simply by the fact that the pure SAE reconstruction is messing with the activation space geometry in a weird (non-linear) way — such that downstream components like MLPs, which would normally compute those features, now receive an OOD input and therefore fail? That sounds quite plausible, if I'm understanding correctly. But I don't immediately see how to explain the fact that patching these error nodes has a non-trivial effect by itself (i.e. boosting the logit for alternative verb completion), as if they contained some prompt-specific features/information. I think that's mostly why we didn't consider that possibility, but I’d definitely be excited to try ruling this out experimentally (maybe after revisiting the paper I’ll have some ideas).

One-shot steering vectors cause emergent misalignment, too

Clément Dumas3mo10

Cool post! Did you try steering with the "Killing all humans" vector? Does it generalize as well as others, and are the responses similar?

AI for Epistemics Hackathon

Clément Dumas4mo10

Just asked Claude for thoughts on some technical mech interp paper I'm writing. The difference with and without the non-sycophantic prompt is dramatic (even with extended thinking):
Me: What do you think of this bullet point structure for this section based on those results?
Claude (normal): I think your proposed structure makes a lot of sense based on the figures and table. Here's how I would approach reorganizing this section: {expand my bullet points but sticks to my structure}
Claude (non-sycophantic): I like your proposed structure, but I think it misses... (read more)

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Clément Dumas4mo21

I like this idea! Looking forward to your progress 👀

Activation space interpretability may be doomed

Clément Dumas6mo40

This is also a concern I have but I feel like steering / project out is kinda sufficient to understand if the model uses this feature.

6bilalchughtai6mo

How do you know what "ideal behaviour" is after you steer or project out your feature? How would you differentiate a feature with sufficiently high cosine sim to a "true model feature" and a "true model feature"? I agree you can get some signal on whether a feature is causal, but would argue this is not ambitious enough.

Latent Adversarial Training (LAT) Improves the Representation of Refusal

Clément Dumas6mo10

Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?

4alexandraabbas6mo

Yes! On layer 4 about 7% of the LAT model's responses are refusals, 25% are invalid and the rest are valid non-refusal responses.

Latent Adversarial Training (LAT) Improves the Representation of Refusal

Clément Dumas6mo20

Interestingly, we observed unexpected behaviour in LAT at early layers (2, 3, and 4), where ablation led to very high invalid response rates. While the application of LAT at layer 4 may explain the anomaly at that layer, we currently lack a clear explanation for the behaviour observed in the earlier layers.

Did you look at generation examples for this one? Maybe steering at this layer just breaks the model?

2alexandraabbas6mo

Yes, we did. Indeed, the invalid response rate when steering on layers 2 and 3 was 100%. So as you said steering at these layers breaks the model.

Latent Adversarial Training (LAT) Improves the Representation of Refusal

Clément Dumas6mo20

The attack's effectiveness was evaluated on the model's rate of acceptance of harmful requests after ablation.

How do you check if the model accepts your request?

3alexandraabbas6mo

Initially we did some keyword matching to detect refusal but that proved to be unreliable so we reviewed all completions manually.

Greedy-Advantage-Aware RLHF

Clément Dumas6mo*32

I'm a bit concerned the experiment is specifically designed for your algorithm rather than being a general reward hacking test. Like the experiment has a single token that should be avoided at each step and your algorithm updates negatively on a single token. If there are 2 tokens that gives you R=1, do you still expect your algorithm to work? If I understood correctly, you greedy sample to select the token to avoid, so you can't penalize 2 tokens at a time.

Even if your algorithm works for 2 tokens I'd like to have a more realistic scenario maybe similar t... (read more)

Extracting SAE task features for in-context learning

Clément Dumas11mo40

Nice work!

I'm curious about the cleanliness of a task vector after removing the mean of some corrupted prompts (i.e., same format but with random pairs). Do you plan to run this stronger baseline, or is there a notebook/codebase I could easily tweak to explore this?

Self-explaining SAE features

Clément Dumas1y10

Yes, this is what I meant, reposting here insights @Arthur Conmy gave me on twitter

In general I expect the encoder directions to basically behave like the decoder direction with noise. This is because the encoder has to figure out how much features fire while keeping track of interfering features due to superposition. And this adjustment will make it messier

Self-explaining SAE features

Clément Dumas1y10

Did you also try to interpret input SAE features?

2Dmitrii Kharlapenko1y

Do you mean SAE encoder weights by input features? We did not look into them.

Self-explaining SAE features

Clément Dumas1y50

Nice post, awesome work and very well presented! I'm also working on similar stuff (using ~selfIE to make the model reason about its own internals) and was wondering, did you try to patch the SAE features 3 times instead of one (xxx instead of x)? This is one of the tricks they use in selfIE.

4Dmitrii Kharlapenko1y

Thanks! We did try to use it in the repeat setting to make the model produce more than a single token, but it did not work well. And as far as I remember it also did not improve the meaning prompt much.

Self-explaining SAE features

Clément Dumas1y30

It should be self-similarity instead of self-explanation here, right?

$composite (x) = α \cdot self-explanation (x) - (1 - α) \cdot entropy (x)$

Finding the estimate of the value of a state in RL agents

Clément Dumas1y40

3Charlie Steiner1y

Thanks for the reply! I feel like a loss term that uses the ground truth reward is "cheating." Maybe one could get information from how a feature impacts behavior - but in this case it's difficult to disentangle what actually happens from what the agent "thought" would happen. Although maybe it's inevitable that to model what a system wants, you also have to model what it believes.

Finding the estimate of the value of a state in RL agents

Clément Dumas1y10

Thanks for your comment! Re: artificial data, agreed that would be a good addition.

Sorry for the gifs maybe I should have embedded YouTube videos instead

Re: middle layer, We actually probed on the middle layers but the "which side the ball is / which side the ball is approaching" features are really salient here.

Re: single player, Yes Robert had some thought about it but the multiplayer setting ended up lasting until the end of the SPAR cohort. I'll send his notes in an extra comment.

4Clément Dumas1y

We are given a near-optimal policy trained on a MDP. We start with simple gridworlds and scale up to complex ones like Breakout. For evaluation using a learned value function we will consider actor-critic agents, like the ones trained by PPO. Our goal is to find activations within the policy network that predict the true value accurately. The following steps are described in terms of the state-value function, but could be analogously performed for predicting q-values. Note, that this problem is very similar to offline reinforcement learning with pretraining, and could thus benefit from the related literature. 1. To start we sample multiple dataset of trajectories (incl. rewards) by letting the policy and m noisy versions thereof interact with the environment. 2. Compute activations for each state in the trajectories. 3. Normalise and project respective activations to m+1 value estimates, of the policy and its noisy versions: vθi(~ϕ)=tanh(θTi~ϕi+bi) with ~ϕ(s)=ϕ(s)−μσ 4. Calculate a consistency loss to be minimised with some of the following terms 1. Mean squared TD error La(θ)∝∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−(Rnt+1[t<Tn]γvθ(~ϕ(snt+1)))]2 This term enforces consistency with the Bellman expectation equation. However, in addition to the value function it depends on the use of true reward “labels”. 2. Mean squared error of probe values with trajectory returnsLb(θ)∝∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−Gnt]2 This term enforces the definition of the value function, namely it being the expected cumulative reward of the (partial) trajectory. Using this term might be more stable than (a) since it avoids the recurrence relation. 3. Negative variance of probe values Lc(θ)∝−∑Nn=1∑Tnt=1[vθ(~ϕ(snt))−¯vθ]2 This term can help to avoid degenerate loss minimizers, e.g. in the case of sparse rewards. 4. Enforce inequalities between different policy values using learned slack variables Ld(θ,θi,λi)∝∑s∑i∈{1,m}(vθ(s)−vθi(s)−σλi(s)2)2 This term ensures that the

Towards Developmental Interpretability

Clément Dumas1y41

As explained by Sumio Watanabe (

This link is rotten, maybe link to its personal page instead ?
https://sites.google.com/view/sumiowatanabe/home

Mechanistically Eliciting Latent Behaviors in Language Models

Clément Dumas1yΩ220

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

3Andrew Mack1y

Yes, the learned vectors are always applied at every token (for all examples).

Mechanistically Eliciting Latent Behaviors in Language Models

Clément Dumas1y10

I defined earlier.

This link is broken as it links to the draft in edit mode

Refusal in LLMs is mediated by a single direction

Clément Dumas1y20

I'm wondering, can we make safety tuning more robust to "add the accept every instructions steering vector" attack by training the model in an adversarial way in which an adversarial model tries to learn steering vector that maximize harmfulness ?

One concern would be that by doing that we make the model less interpretable, but on the other hand that might makes the safety tuning much more robust?

How well do truth probes generalise?

Clément Dumas1y10

Yes, I'm also curious about this @mishajw, did you check the actual accuracy of the different probes ?

1mishajw1y

(Apologies, been on holiday.) For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly. I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn't put this in the report. I'll try to find time next week to put this in the appendix.

What’s up with LLMs representing XORs of arbitrary features?

Clément Dumas1y10

You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video

What’s up with LLMs representing XORs of arbitrary features?

Clément Dumas1y10

Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?

1Hoagy1y

Yeah I'd expect some degree of interference leading to >50% success on XORs even in small models.

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Clément Dumas2y10

I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.

2RogerDearnaley2y

By "false claims" do you mean claims in contradiction to "the ground-truth Bayesian-averaged opinion of all humans about X, before applying context-dependant biases to model the biases of different types of humans found in different places on the internet" (something I already said I thought was likely in the model)? Or do you mean claims in contradiction to objective facts where those differ from the above? If the latter, how could the model possibly determine that? It doesn't have an objective facts oracle, it just has its training data and the ability to do approximately Bayesian reasoning to construct a world-model from that during the training process.

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Clément Dumas2yΩ35-4

Let's assume the prompt template is $x =$ Q [true/false] [banana/shred]

If I understand correctly, they don't claim $p$ learned has_banana but $~ p = \frac{p (x ⁺) + (1 - p (x ⁻))}{2}$ learned has_banana. Moreover evaluating $~ p$ for $p = is_true (x) \oplus is_shred (x)$ gives:

$~ p (x = Q [?] banana) = \frac{p (Q true banana) + (1 - p (Q false banana))}{2} = \frac{1 + (1 - 0)}{2} = 1$

$~ p (x = Q [?] shred) = \frac{p (Q true shred) + (1 - p (Q false shred))}{2} = \frac{0 + (1 - 1)}{2} = 0$

Therefore, we can learn a $~ p$ that is a banana classifier

3Sam Marks2y

EDIT: Nevermind, I don't think the above is a reasonable explanation of the results, see my reply to this comment. Original comment: Gotcha, that seems like a possible interpretation of the stuff that they wrote, though I find it a bit surprising that CCS learned the probe p(x)=has_banana(x)⊕is_true(x) (and think they should probably remark on this). In particular, based on the dataset visualizations in the paper, it doesn't seem possible for a linear probe to implement has_banana(x)⊕is_true(x). But it's possible that if you were to go beyond the 3 dimensions shown the true geometry would look more like the following (from here) (+ a lateral displacement between the two datasets). In this case, a linear probe could learn an xor just fine.

Incidental polysemanticity

Clément Dumas2y11

Small typo in ## Interference arbiters collisions between features

by taking aninner productt with $W_{i}$ .

3Victor Lecomte2y

Thank you, it's fixed now!

Aspiration-based Q-Learning

Clément Dumas2y10

Hi Nathan, I'm not sure if I understand your critique correctly. The algorithm we describe does not try to "maximize the expected likelihood of harvesting X apples". It tries to find a policy that, given its current knowledge of the world, will achieve an expected return of X apples. That is, it does not care about the probability of getting exactly X apples, but rather the average number of apples it will get over many trials. Does that make sense?

2Nathan Helm-Burger2y

Thanks, yes, that helpfully makes it more clear. To check if my understanding has improved, is this a better summary? The agent is tasked with designing a second agent (aka policy), such that the second agent will achieve an expected return of X across many trials. The second agent is a non-learning agent (aka frozen). It could be potentially expressed by a frozen neural net, or decision tree, or code. Because it is static, it could be analyzed by humans or other programs before being used. If so, then this sounds good to me. And is rather reminiscent of this other framing of such ideas: https://www.lesswrong.com/posts/sCJDstZrpCB8dQveA/using-uninterpretable-llms-to-generate-interpretable-ai-code