Clément Dumas

I'm a CS master's student at ENS Paris-Saclay. I want to pursue a career in AI safety research.

https://butanium.github.io/

Posts

Sorted by New

7Finding the estimate of the value of a state in RL agents

6mo

38Aspiration-based Q-Learning

Wiki Contributions

Comments

Sorted by

Newest

Extracting SAE task features for in-context learning

Clément Dumas3mo30

Nice work!

I'm curious about the cleanliness of a task vector after removing the mean of some corrupted prompts (i.e., same format but with random pairs). Do you plan to run this stronger baseline, or is there a notebook/codebase I could easily tweak to explore this?

Self-explaining SAE features

Clément Dumas4mo10

Yes, this is what I meant, reposting here insights @Arthur Conmy gave me on twitter

In general I expect the encoder directions to basically behave like the decoder direction with noise. This is because the encoder has to figure out how much features fire while keeping track of interfering features due to superposition. And this adjustment will make it messier

Self-explaining SAE features

Clément Dumas4mo10

Did you also try to interpret input SAE features?

Self-explaining SAE features

Clément Dumas4mo50

Nice post, awesome work and very well presented! I'm also working on similar stuff (using ~selfIE to make the model reason about its own internals) and was wondering, did you try to patch the SAE features 3 times instead of one (xxx instead of x)? This is one of the tricks they use in selfIE.

Self-explaining SAE features

Clément Dumas4mo30

It should be self-similarity instead of self-explanation here, right?

Finding the estimate of the value of a state in RL agents

Clément Dumas5mo40

We are given a near-optimal policy trained on a MDP. We start with simple gridworlds and scale up to complex ones like Breakout. For evaluation using a learned value function we will consider actor-critic agents, like the ones trained by PPO. Our goal is to find activations within the policy network that predict the true value accurately. The following steps are described in terms of the state-value function, but could be analogously performed for predicting q-values. Note, that this problem is very similar to offline reinforcement learning with pretraining, and could thus benefit from the related literature.

To start we sample multiple dataset of trajectories (incl. rewards) by letting the policy and noisy versions thereof interact with the environment.
Compute activations for each state in the trajectories.
Normalise and project respective activations to $m + 1$ value estimates, of the policy and its noisy versions: $v_{θ_{i}} (~ ϕ) = tanh (θ_{i}^{T} {~ ϕ}_{i} + b_{i}) with ~ ϕ (s) = \frac{ϕ (s) - μ}{σ}$
Calculate a consistency loss to be minimised with some of the following terms
1. Mean squared TD error $L_{a} (θ) \propto \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} {[v_{θ} (~ ϕ (s_{t}^{n})) - (R_{t}^{n} + 1 [t < T_{n}] γ v_{θ} (~ ϕ (s_{t + 1}^{n})))]}_{θ}^{2}$
  This term enforces consistency with the Bellman expectation equation. However, in addition to the value function it depends on the use of true reward “labels”.
2. Mean squared error of probe values with trajectory returns $L_{b} (θ) \propto \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} {[v_{θ} (~ ϕ (s_{t}^{n})) - G_{t}^{n}]}_{θ}^{2}$
  This term enforces the definition of the value function, namely it being the expected cumulative reward of the (partial) trajectory. Using this term might be more stable than (a) since it avoids the recurrence relation.
3. Negative variance of probe values $L_{c} (θ) \propto - \sum_{n = 1}^{N} \sum_{t = 1}^{T_{n}} {[v_{θ} (~ ϕ (s_{t}^{n})) - {¯ v}_{θ}]}_{θ}^{2}$
  This term can help to avoid degenerate loss minimizers, e.g. in the case of sparse rewards.
4. Enforce inequalities between different policy values using learned slack variables $L_{d} (θ, θ_{i}, λ_{i}) \propto \sum_{s} \sum_{i \in {1, m}} (v_{θ} (s) - v_{θ_{i}} (s) - σ_{λ_{i}} (s)^{2})^{2}$
  This term ensures that the policy consistently dominates its noisy versions and is completely unsupervised.
Train the linear probes using the training trajectories
Evaluate on held out test trajectories by comparing the value function to the actual returns. If the action space is simple enough, use the value function to plan in the environment and compare resulting behaviour to that of the policy.

Finding the estimate of the value of a state in RL agents

Clément Dumas5mo10

Thanks for your comment! Re: artificial data, agreed that would be a good addition.

Sorry for the gifs maybe I should have embedded YouTube videos instead

Re: middle layer, We actually probed on the middle layers but the "which side the ball is / which side the ball is approaching" features are really salient here.

Re: single player, Yes Robert had some thought about it but the multiplayer setting ended up lasting until the end of the SPAR cohort. I'll send his notes in an extra comment.

Towards Developmental Interpretability

Clément Dumas6mo30

As explained by Sumio Watanabe (

This link is rotten, maybe link to its personal page instead ?
https://sites.google.com/view/sumiowatanabe/home

Mechanistically Eliciting Latent Behaviors in Language Models

Clément Dumas7moΩ220

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

Mechanistically Eliciting Latent Behaviors in Language Models

Clément Dumas7mo10

I defined earlier.

This link is broken as it links to the draft in edit mode