Abstract
Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale. With $k=1$, we localize individual neurons which are highly relevant for a particular feature, and perform a number of case studies to illustrate general properties of LLMs. In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics. In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.
See twitter summary here.
Contributions
In the first part of the paper, we outline several variants of sparse probing, discuss the various subtleties of applying sparse probing, and run a large number of probing experiments. In particular, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 2 orders of magnitude in parameter count (up to 6.9 billion). The majority of the paper then focuses on zooming-in on specific examples of general phenomena in a series of more detailed case studies to demonstrate:
- There is a tremendous amount of interpretable structure within the neurons of LLMs, and sparse probing is an effective methodology to locate such neurons (even in superposition), but requires careful use and follow up analysis to draw rigorous conclusions.
- Many early layer neurons are in superposition, where features are represented as sparse linear combinations of polysemantic neurons, each of which activates for a large collection of unrelated $n$-grams and local patterns. Moreover, based on weight statistics and insights from toy models, we conclude that the first 25\% of fully connected layers employ substantially more superposition than the rest.
- Higher-level contextual and linguistic features (e.g., \texttt{is\_python\_code}) are seemingly encoded by monosemantic neurons, predominantly in middle layers, though conclusive statements about monosemanticity remain methodologically out of reach.
- As models increase in size, representation sparsity increases on average, but different features obey different dynamics: some features with dedicated neurons emerge with scale, others split into finer grained features with scale, and many remain unchanged or appear somewhat randomly.
We will have a follow up post in the coming weeks with what we see as the key alignment takeaways and open questions following this work.
Can you elaborate? I don't really follow, this seems like a pretty niche concern to me that depends on some strong assumptions, and ignores the major positive benefits of interpretability to alignment. If I understand correctly, your concern is that if AIs can know what the other AIs will do, this makes inter-AI coordination easier, which makes a human takeover easier? And that dangerous AIs will not be capable of doing this interpretability on AIs themselves, but will need to build on human research of mechanistic interpretability? And that mechanistic interpretability is not going to be useful for ensuring AIs want to establish solidarity with humans, noticing collusion, etc such that it's effect helping AIs coordinate dominates over any safety benefits?
I don't know, I just don't buy that chain of reasoning.