Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger. RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.17700
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao. A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2407.02646
Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger. RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.17700
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Joseph Bloom. Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small. LessWrong, 2024. URL https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream
Andrew Quaisley. Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT. LessWrong, 2024. URL https://www.lesswrong.com/posts/ignCBxbqWWPYCdCCx/research-report-alternative-sparsity-methods-for-sparse
Robert_AIZI. Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT. LessWrong, 2024. URL https://www.lesswrong.com/posts/arEub4eHo2EJfCHQX/research-report-sparse-autoencoders-find-only-9-180-board-state-features-in-othellogpt
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
Robert_AIZI. Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT. LessWrong, 2024. URL https://www.lesswrong.com/posts/arEub4eHo2EJfCHQX/research-report-sparse-autoencoders-find-only-9-180-board-state-features-in-othellogpt
Robert_AIZI. Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT. LessWrong, 2024. URL https://www.lesswrong.com/posts/arEub4eHo2EJfCHQX/research-report-sparse-autoencoders-find-only-9-180-board-state-features-in-othellogpt
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2405.12241.
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2405.12241.
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2405.12241.
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2405.12241.
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2405.12241.
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2405.12241.
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2405.12241.
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Summary
Sparse Autoencoders (SAEs) have become a popular tool in Mechanistic Interpretability but at the time of writing there is no standardm automated way to accurately assess the quality of a SAE. In practice many different metrics are applied and interpreted to assess SAE quality in different studies.
Conceptually however what makes a good SAE is relatively uncontroversial: its latent features must reflect the structure of the model it was trained on whilst also reflecting the structure of human linguistic understanding[1][2][3]. These twin objectives are usually referred to as "Faithfulness" and "Interpretability"[1][4].
In this article we will go through 8 common metrics used to evaluate the quality of a SAE and try to demonstrate that, though this is sometimes not stated explicitly, each metric can be thought of as attempting to measure either Faithfulness or Interpretability.
This article is not an attempt to prove anything, but rather suggest a mental model which might be useful.
Faithfullness vs Interpretability
At this stage there is no widespread practical use for SAEs. When one arises a metric pertaining to this use will probably become the gold standard for what makes a "good" SAE.
Until then the qualities sought in an SAE are stated well by Huang et. al.[2]
In other words a good SAE will be one which identifies "abstract concepts" in the linguistic sense and maps these to structures inside the host LM[1].
Importantly these two qualities are independent of one another. In fact to anyone who has trained a SAE the two will probably seem frustratingly opposed.
Interpretability can be achieved more easily by sacrificing the close relation held between the Autoencoder and the host model. For example by adding layers to the encoder and decoder pipelines.
Similarly Faithfullness can be achieved without interpretability. The most extreme example would be the neurons of the LM themselves, which are notoriously difficult to interpret[1]. A less extreme example would be an Autoencoder trained without any, or a very low sparsity loss term.
The goal of much of the current SAE literature is training one which achieves both Faithfullness and Interpretability at the same time.
Explainability vs Interpretability
Importantly in the sense we are using it here "Interpretability" has a slightly broader meaning than its literal definition. We are expanding it to refer to a general correspondence with human understanding of language and thought.
The term Explainability as it is used in the literature has a narrower definition. It refers to any phenomena of a SAE which can be understood by a human[3][1]. Under this definition the ideal SAE must be more than merely perfectly Explainable. Features which fire with perfect accuracy on single words are highly Explainable but if these kinds of features make up the entire SAE it probably isn't going to be useful for anything.
Instead we want a SAE whose features are also interesting to humans and generally allow them to communicate with (e.g. steer) and understand LMs in ways that they find useful. This more stringent requirement is what we mean when we use the word Interpretability in this article, though this is by no means a recommendation that others follow suit.
Popular Metrics
What follows is an enumeration of 8 metrics used in the SAE literature and their advantages and disadvantages as measures of Faithfulness and Interpretability. To give a quick overview we rank them below (very roughly) by how difficult they are to use in practice.
Formalization
Let M be a Language Model.
Let si be the i'th sample in an evaluation dataset E containing n total samples.
Let Ai be the activations which flow out of a chosen transformer block lj of the residual stream of Mwhen the sample si is fed into M.
Let S be a Sparse Autoencoder trained to take in Ai, encode it into a latent representation Li of rank d and then decode it back as close to Ai again as possible using the same methods as Bricken et. al.[1]. Let ^Ai be the actual output of S after decoding. Let Se be just the encoder portion of S such that Se(Ai)=Li. Let Sd be the decoder portion such that Sd(Li)=^Ai.
Measures of Interpretability
L1
Description
The L1 norm for a single vector is the sum absolute values in a vector. In the context of SAEs the vector in question is the latent inside the Autoencoder. Geometrically L1 can be thought of as the "distance" between the vector and the origin.
L1(Li)=∥Li∥1=∑dk=1|Lk|
Usually in the literature L1 refers to the average L1 over E.
Usage
L1 is a key component of most sparse auto-encoder training pipelines. It is almost always the only component of the training loss which encourages sparsity in L.
Advantages
Disadvantages
L0
Description
L0 is the count of non-zero elements in the latent vector L. Mathematically:
L0(Li)=∥Li∥0=∑dk=11{Lk≠0}
Usage
In the literature L0, usually averaged across E, is used as a direct measure of sparsity of the Autoencoder S[5] and hence a very rough measure of interpretability or potential interpretability.
Advantages
Disadvantages
Probe Loss[7][3]
Description
To calculate probe loss you first curate a binary classification dataset to reflect some concept such that it contains positive samples which exemplify that concept and also negative samples which do not. For instance the positive examples might contain positive sentiment and the negative ones neutral or negative sentiment.
Then when you want to assess your SAE you train a simple classifier, such as a logistic regression model to take the Li derived from a classification dataset sample as input and perform the classification. The Probe loss is simply the classification loss for that task.
More formally let the classification dataset be CD containing w samples.
Let each sample from that dataset be (xi,yi)∈CD where xi is the ith input sample and yi is its ground truth label.
Let C(Li)=^y be a simple classifier which takes in the Autoencoder latent for the sample xi an returns a classification prediction. Let Me be a function which reads a sample xi into M and returns the activations Ai of block lj i.e. Li=Se(Me(xi)).
Let lf(yi,^yi) be some loss function such as Binary Cross EntropyL:yilog(^yi)+(1−yi)log(1−^yi)
Let PL=1w∑wi=1lf(C(Se(Me(xi)),yi) be the Probe Loss for that particular dataset.
Purpose
Probe Loss is named because of its relation to "Linear Probes". It has been used as a measure of Interpretability in Gao et. al.[3], where multiple classification datasets were created, most of them encoding high level concepts. However the same technique was used by Robert Huben as a measure of Faithfulness to determine whether a SAE could recover simple board state features from an OthelloGPT model[7]. In a very straightforward way Probe Loss gives us an idea of whether L can be used to perform a particular kind of classification.
Advantages
Disadvantages
A SAE may have good Faithfulness and Interpretability while failing to facilitate any particular classification capability[7]. Therefore the loss on a single probe is likely not a good measure of overall SAE Interpretability or Faithfulness. Gao et. al.[3] found that even when averaging 61 Probes their Probe Loss was quite noisy, as can be seen in the figure below.
Activation Prediction Loss[3][8]
Description
For each feature you wish to analyze:
Now repeat this for each feature to get an overall Activation Prediction Loss.
Purpose
The idea behind Activation Prediction Loss is that features which cannot be predicted are likely to be random or nonsensical. Often the prediction function p(E,ri) is a LM and is thought of as a proxy for human understanding, with the idea being that features which can be predicted accurately by humans are likely to be Explainable.
Advantages
Disadvantages
Human Analysis[1]
Description
Create a human-readable "Interpretability" rubric, then for each feature calculate an Interpretability score by
For instance in Bricken et. al.[1] the authors created a rubric which had the assessor read all the examples, form an interpretation for what the underlying feature was and then essentially rank how explainable and internally consistent the feature was.
Purpose
In Bricken et. al.[1] the rubric was only aimed at assessing Explainability but rubrics could be designed to assess Interpretability more generally.
Advantages
Disadvantages
Measures of Faithfulness
MSE Reconstruction
Description
In the context of SAEs MSE Reconstruction is the Mean Squared Error between the model activations Ai for a particular sample si and the Autoencoder reconstruction ^Ai
MSER=1n∑ni=1(Ai−^Ai)2
Purpose
MSER is primarily used when training SAEs and it usually forms a major component of the loss function. The MSER loss term is designed to train the latent to be as close to a perfect compressed (or in the case of most SAEs, overcomplete and and sparse) representation of A as possible.
Advantages
Disadvantages
We can imagine cases where an Autoencoder achieves a very low MSER, but fails to accurately represent M in L. For instance where ^A is always very close to A but the small differences turn out to be the most important component of A from the perspective of M.[9] Because of this MSER is at best a very rough a proxy of Faithfulness.
Downstream Reconstruction[9]
Description
Downstream Reconstruction is simply MSER taken between the activations at every block after lj (though in theory you could include lj as well), after we substitute the usual Ai for the reconstructed ^Ai in the forward pass of M. It is calculated as follows:
Let M have g transformer blocks.
Let Aih denote the activation of the hth block in M when M is given the sample si as input.
To add some visuals, the blue arrows in the diagram below collectively represent DR for a single sample as it was implemented in Braun et. al.[9]
Purpose
The Purpose of DR is to provide a version of MSER which is sensitive to downstream changes within the network, not just the immediate geometric difference between Ai and ^Ai
Advantages
Disadvantages
Downstream Loss[9][3]
Description
Downstream Loss is very similar to Downstream Reconstruction. The main difference is that it the loss is formed on the basis of the final token probabilities coming out of M rather than intermediate activations. It is calculated as follows:
Repeat steps 1-7 as many times as you like on different samples and take the average KL divergence to arrive at Downstream Loss.
Purpose
The most common purpose of Downstream Loss is to identify "functionally important features" [9] i.e. those elements in L which have an impact on what tokens end up coming out of M.
Advantages
Disadvantages
What about Universality?
Universality is another quality we would like our latent L to achieve. L is considered universal when its features remain Interpretable across many different corpuses and to many different humans (for instance humans who speak different languages) and when its features correspond to the internal structures of many different LMs, not just M[1].
In this way we may think of there being two kinds of universality, Universal Interpretability and Universal Faithfulness. For instance, we can easily imagine training a SAE on a very limited corpus of Urdu technical jargon. Such a SAE would learn features which would not be Universally Interpretable (they would not seem to make sense to many humans) but could well be Universally Faithful (they may appear across many multilingual LMs).
Universality is certainly a quality we might want to achieve when training a SAE but we mention it here mainly to show that thinking about it through the lens of the two main objectives of Faithfulness and Interpretability is a reasonable thing to do.
Conclusion
It may be useful to conceptualize a good Sparse Autoencoder as one which maintains a close correspondence with the internal structure of its host model (Faithfulness) and with the structure of human language understanding (Interpretability).
While not always explicitly stated, many metrics used in practice to assess the quality of Sparse Autoencoders can be thought of as measuring one of these two qualities. Keeping this underlying dichotomy in mind may prove useful when assessing and explaining the outcomes of Sparse Autoencoder evaluations, especially to newcomers.
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemanticfeatures/index.html
Jing Huang, Zhengxuan Wu, Christopher Potts, Mor Geva, Atticus Geiger. RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations. arXiv preprint, 2024. URL https://arxiv.org/abs/2402.17700
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2406.04093
Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao. A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2407.02646
Joseph Bloom. Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small. LessWrong, 2024. URL https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream
Andrew Quaisley. Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT. LessWrong, 2024. URL https://www.lesswrong.com/posts/ignCBxbqWWPYCdCCx/research-report-alternative-sparsity-methods-for-sparse
Robert_AIZI. Research Report: Sparse Autoencoders find only 9/180 board state features in OthelloGPT. LessWrong, 2024. URL https://www.lesswrong.com/posts/arEub4eHo2EJfCHQX/research-report-sparse-autoencoders-find-only-9-180-board-state-features-in-othellogpt
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey. Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning. arXiv preprint, 2024. URL https://doi.org/10.48550/arXiv.2405.12241.