This is a linkpost for https://arxiv.org/abs/2410.08869

This research was completed for the Supervised Program for Alignment Research (SPAR) summer 2024 iteration. The team was supervised by @Stefan Heimersheim (Apollo Research). Find out more about the program and upcoming iterations here.

TL,DR: We look for related SAE features, purely based on statistical correlations. We consider this a cheap method to estimate e.g. how many new features there are in a layer and how many features are passed through from previous layers (similar to the feature lifecycle in Anthropic’s Crosscoders). We find communities of related features, and features that appear to be quasi-boolean combinations of previous features.

Here’s a web interface showcasing our feature graphs.

Communities of sparse features through a forward pass. Nodes represent residual stream SAE features that were active in the residual stream for a specific prompt of text. The rows of the graph correspond to layers in GPT-2 (the bottom row is an earlier layer). The edges represent the Jaccard similarity of the activations of features across many other prompts. Colors represent different subgraphs discovered by a community-finding algorithm. Features within a community typically capture similar concepts. More graphs can be viewed in the feature browser.

We ran many prompts through GPT-2 residual stream SAEs and measured which features fired together, and then created connected graphs of frequently co-firing features (“communities”) that spanned multiple layers (inspired by work from Marks 2024). As expected, features within communities fire on similar input tokens. In some circumstances features appeared to “specialize” in later layers.

Feature evolution: a layer 2 SAE feature (bottom) which detects “evidence” in many different contexts specializes into later layer features which detect “evidence” in mutually exclusive contexts. These relationships were discovered by counting, of all the times that a feature activated on a set of inputs, which previous-layer feature activated most frequently as well.

You can read the paper on arXiv, or find us at the ATTRIB workshop at NeurIPS!

New Comment