ntt123

Message

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

ABSTRACT: We introduce a straightforward yet effective method to break down transformer outputs into individual components. By treating the model’s non-linear activations as constants, we can decompose the output in a linear fashion, expressing it as a sum of contributions. These contributions can be easily calculated using linear projections. We...

Jun 17, 20245

Exploring Llama-3-8B MLP Neurons

TL;DR: We created a dataset of text snippets that strongly activate neurons in Llama-3-8B model. This dataset shows meaningful features that can be found. Explore the neurons with the web interface: https://neuralblog.github.io/llama3-neurons/neuron_viewer.html An example of a "derivative" neuron which is triggered when the text mentions the concept of derivatives. Introduction...

Jun 9, 202410

LESSWRONG
LW

LESSWRONG
LW

ntt123

ntt123

ntt123

ntt123

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

Exploring Llama-3-8B MLP Neurons

Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability

Exploring Llama-3-8B MLP Neurons