Tom Angsten

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning

Overview

When I began my attempt to replicate Anthropic's Towards Monosemanticity paper, I had high expectations for how interpretable the extracted features should be. If I judged a feature to be uninterpretable, I attributed this to either sub-optimal sparse autoencoder hyperparameters, a bug in my implementation, or possible shortcomings of dictionary learning. Interpretability of the features was the ultimate feedback on my degree of success or failure. However, I now believe that, even without these factors, that is, even if a sparse autoencoder learned the exact set of features represented by the model, it's possible that a subset of these could be much less interpretable than I had initially expected.

This post presents one... (read 2534 more words →)

Replying toTowards Monosemanticity: Decomposing Language Models With Dictionary Learning

Tom Angsten2y*

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Superposition of features is only advantageous at a certain point in a network when it is followed by non-linear filtering, as explained in Toy Models of Superposition. Yet, this work places the sparse autoencoder at a point in the one-layer LLM which, up to the logits, is not followed by any non-linear operations. Given this, I would expect that there is no superposition among the activations fed to the sparse autoencoder, and that 512 (the size of the MLP output vector) is the maximum number of features the model can usefully represent.

If the above is true, then expansion factors to the sparse representation greater than 1 would not improve the quality or... (read more)

Replying toGround-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Tom Angsten2y

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

I don't think dataset imbalance is the cause of the poor performance for auto-regressive models when unsupervised methods are applied. I believe both papers enforced a 50/50 balance when applying CCS.

So why might a supervised probe succeed when CCS fails? My best guess is that, for the datasets considered in these papers, auto-regressive models do not have sufficiently salient representations of truth after constructing contrast pairs. Contrast pair construction does not guarantee isolating truth as the most salient feature difference between the positive and negative representations. For example, imagine for IMDB movie reviews, the model most saliently represents consistency between the last completion token ('positive'/'negative') and positive or negative words in the... (read more)

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Tom Angsten

Tom Angsten, Ami Hays

Summary

Contrast-Consistent Search (CCS) is a method for finding truthful directions within the activation spaces of large language models (LLMs) in an unsupervised way, introduced in Burns et al., 2022. However, all experiments in that study involve training datasets that are balanced with respect to the ground-truth labels of the questions used to generate contrast pairs.^[1] This allows for the possibility that CCS performance is implicitly dependent on the balance of ground-truth labels, and therefore is not truly unsupervised.

In this post, we demonstrate that the imbalance of ground-truth labels in the training dataset can prevent CCS, or any contrast-pair-based unsupervised method, from consistently finding truthful directions in an LLM's activation space.

This post is... (read 1942 more words →)

LESSWRONG
LW

LESSWRONG
LW

Tom Angsten

Tom Angsten

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Tom Angsten

Tom Angsten

Tom Angsten

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Overview

Summary