2y

This blog post discusses a collaborative research paper on sparse autoencoders (SAEs), specifically focusing on SAE evaluations and a new training method we call p-annealing. As the first author, I primarily contributed to the evaluation portion of our work. The views expressed here are my own and do not necessarily reflect the perspectives of my co-authors. You can access our full paper here.

Key Results

In our research on evaluating Sparse Autoencoders (SAEs) using board games, we had several key findings:

We developed two new metrics for evaluating Sparse Autoencoders (SAEs) in the context of board games: board reconstruction and coverage.
These metrics can measure progress between SAE training approaches that is invisible on existing metrics.
These metrics

... (read 2686 more words →)

1

38

Interpreting Preference Models w/ Sparse Autoencoders

Logan Riggs

Logan Riggs, Jannik Brinkmann

2y

This is the real reward output for an OS preference model. The bottom "jailbreak" completion was manually created by looking at reward-relevant SAE features.

Preference Models (PMs) are trained to imitate human preferences and are used when training with RLHF (reinforcement learning from human feedback); however, we don't know what features the PM is using when outputting reward. For example, maybe curse words make the reward go down and wedding-related words make it go up. It would be good to verify that the features we wanted to instill in the PM (e.g. helpfulness, harmlessness, honesty) are actually rewarded and those we don't (e.g. deception, sycophancey) aren't.

Sparse Autoencoders (SAEs) have been used to decompose... (read 2519 more words →)

12

75

2

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian

abhayesian, Jannik Brinkmann, Victor Levoso

2y

This post is a summary of our paper A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task (ACL 2024). While we wrote and released the paper a couple of months ago, we have done a bad job promoting it so far. As a result, we’re writing up a summary of our results here to reinvigorate interest in our work and hopefully find some collaborators for follow-up projects. If you’re interested in the results we describe in this post, please see the paper for more details.

TL;DR - We train transformer models to find the path from the root of a tree to a given leaf (given an edge list of the tree).... (read 2579 more words →)

1

53

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features

Logan Riggs

Logan Riggs, Jannik Brinkmann

2y

TL;DR

We achieve better SAE performance by:

Removing the lowest activating features
Replacing the L1(feature_activations) penalty function with L1(sqrt(feature_activations))

with 'better' meaning: we can reconstruct the original LLM activations w/ lower MSE & with fewer features/datapoint.

As a sneak peak (the graph should make more sense as we build up to it, don't worry!):

The L1(sqrt()) graph (ie dotted one) is farther to the lower-left corner (this is good!) of low Cross Entropy loss with low features/datapoint (ie L0 Norm).

Now in more details:

Sparse Autoencoders (SAEs) reconstruct each datapoint in [layer 3's residual stream activations of Pythia-70M-deduped] using a certain amount of features (this is the L0-norm of the hidden activation in the SAE). Typically the higher activations are... (read 1049 more words →)

5

26

LESSWRONG
LW

LESSWRONG
LW

Jannik Brinkmann

Interpreting Preference Models w/ Sparse Autoencoders

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

Evaluating Sparse Autoencoders with Board Game Models

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features

Jannik Brinkmann

Jannik Brinkmann

Evaluating Sparse Autoencoders with Board Game Models

Interpreting Preference Models w/ Sparse Autoencoders

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features

Jannik Brinkmann

Interpreting Preference Models w/ Sparse Autoencoders

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

Evaluating Sparse Autoencoders with Board Game Models

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features

Jannik Brinkmann

Jannik Brinkmann

Evaluating Sparse Autoencoders with Board Game Models

Interpreting Preference Models w/ Sparse Autoencoders

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

Improving SAE's by Sqrt()-ing L1 & Removing Lowest Activating Features

TL;DR