LESSWRONG
LW

Robert_AIZI — LessWrong

SAEs you can See: Applying Sparse Autoencoders to Clustering

Robert_AIZI

TL;DR

We train sparse autoencoders (SAEs) on artificial datasets of 2D points, which are arranged to fall into pre-defined, visually-recognizable clusters. We find that the resulting SAE features are interpretable as a clustering algorithm via the natural rule "a point is in cluster N if feature N activates on it".
We primarily work with top-k SAEs (k=1) (as in Gao et al.), with a few modifications:
- Instead of reconstructing the original $(x, y)$ points, we embed each point into a 100-dimensional space, based off its distance to 100 fixed "anchor" points. The embedding of a point $p$ for an anchor point $a$ is roughly $exp (- d (p, a)^{2})$ . This embedded point is both the input and target of the SAE. This embedding allows our method to

... (read 2705 more words →)

Comments on Anthropic's Scaling Monosemanticity

Robert_AIZI

These are some of my notes from reading Anthropic's latest research report, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.

TL;DR

In roughly descending order of importance:

Its great that Anthropic trained an SAE on a production-scale language model, and that the approach works to find interpretable features. Its great those features allow interventions like the recently-departed Golden Gate Claude. I especially like the code bug feature.
I worry that naming features after high-activating examples (e.g. "the Golden Gate Bridge feature") gives a false sense of security. Most of the time that feature activates, it is irrelevant to the golden gate bridge. That feature is only well-described as "related to the golden gate bridge" if

... (read 1965 more words →)

•••

Explaining a Math Magic Trick

Robert_AIZI

Introduction

A recent popular tweet did a "math magic trick", and I want to explain why it works and use that as an excuse to talk about cool math (functional analysis). The tweet in question:

This is a cute magic trick, and like any good trick they nonchalantly gloss over the most important step. Did you spot it? Did you notice your confusion?

Here's the key question: Why did they switch from a differential equation to an integral equation? If you can use $(1 - x)^{- 1} = 1 + x + x^{2} + . . .$ when $x = \int$ , why not use it when $x = d / d x$ ?

Well, lets try it, writing $D$ for the derivative:

$\begin{matrix} f^{'} & = & f (1 - D) f & = & 0 f & = & (1 + D + D^{2} + . . .) 0 f & = & 0 + 0 + 0 + . . . f & = & 0 \end{matrix}$

So now you may be disappointed, but relieved: yes, this version fails, but at least it fails-safe, giving you the trivial solution,... (read 1237 more words →)

101

Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT

Robert_AIZI

[3/7 Edit: I have rephrased the bolded claims in the abstract per this comment from Joseph Bloom, hopefully improving the heat-to-light ratio.

Commenters have also suggested training on earlier layers and using untied weights, and in my experiments this increases the number of classifiers found, so the headline number should be 33/180 features, up from 9/180. See this comment for updated results.]

Abstract

A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic’s Bricken et al). We train a sparse autoencoder on OthelloGPT, a language model trained on transcripts of the board game Othello, which has been shown... (read 2969 more words →)

Rating my AI Predictions

Robert_AIZI

9 months ago I predicted trends I expected to see in AI over the course of 2023. Here's how I did (bold indicates they happened, italics indicates they didn't, neither-bold-nor-italics indicates unresolved):

ChatGPT (or successor product from OpenAI) will have image-generating capabilities incorporated by end of 2023: 70%
No papers or press releases from OpenAI/Deepmind/Microsoft about incorporating video parsing or generation into production-ready LLMs through end of 2023: 90%
All publicly released LLM models accepting audio input by the end of 2023 use audio-to-text-to-matrices (e.g. transcribe the audio before passing it into the LLM as text) (conditional on the method being identifiable): 90%
All publicly released LLM models accepting image input by the end of 2023

... (read 412 more words →)

Comparing Anthropic's Dictionary Learning to Ours

Robert_AIZI

Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.

First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with... (read 1083 more words →)

137

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Logan Riggs

Logan Riggs, Hoagy, Aidan Ewart, Robert_AIZI

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models

We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.

Paper Overview

Sparse Autoencoders & Superposition

To reverse engineer a neural network, we'd like to first break it down into smaller units (features) that can be analysed in isolation. Using individual neurons as these units can be useful but neurons are often polysemantic, activating for several unrelated types of feature so just looking... (read 1201 more words →)

159

Unsafe AI as Dynamical Systems

Robert_AIZI

[Thanks to Valerie Morris for help editing this post.]

Overview

Large Language Models (LLMs) and their safety properties are often studied from the perspective of a single pass: what is the single next token the LLM will produce? But this is almost never how they are deployed. In contrast, LLMs are almost always run autoregressively: they produce tokens sequentially, taking earlier outputs as part of their input, until a halting condition is met (such as a special token or a token limit). In this post I discuss how one might study LLMs as dynamical systems, which emphasizes how an LLM can become more or less safe as it is run autoregressively.

Why Dynamics are Critical:

... (read 876 more words →)

AIs teams will probably be more superintelligent than individual AIs

Robert_AIZI

Summary

Teams of humans (countries, corporations, governments, etc) are more powerful and intelligent than individual humans. Our prior should be the same for AIs. AI organizations may then face coordination problems like management overhead, the principal-agent problem, defection/mutiny, and general Moloch-y-ness. This doesn’t especially reduce risk of AI takeover.

Teams of Humans are more Powerful and Intelligent than Individuals

Human society is at such a scale that many goals can only be achieved by a team of many people working in concert. Such goals include winning a war, building a successful company, making a Hollywood movie, and realizing scientific achievements (e.g. building nuclear bombs, going to the moon, eradicating smallpox).

This is not a coincidence, because... (read 417 more words →)

[Research Update] Sparse Autoencoder features are bimodal

Robert_AIZI

Overview

The sparse autoencoders project is a mechanistic interpretability effort to algorithmically find semantically meaningful “features” in a language model. A recent update hints that features learned by this approach separate into two types depending on their maximum cosine similarity (MCS) score against a larger feature dictionary:

High-MCS features that reoccur across hyperparameters (speculatively, the “real” features that would be helpful for mechanistic interpretability)
Low-MCS features that do not reoccur (speculatively, dead neurons or artifacts of random noise)

Figure 1: Figure 3 from the replication, showing that MCS scores are bimodal, with peaks near MCS=.3 and MCS=1.

In this post, we:

Demonstrate that the MCS distribution of the low-MCS features matches the distribution of random vectors.
Present data show

... (read 1212 more words →)

1M/461441	Criticism of left-wing politics / Democrats
1M/77390	Criticism of right-wing politics / Republicans