Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability
ABSTRACT: We introduce a straightforward yet effective method to break down transformer outputs into individual components. By treating the model’s non-linear activations as constants, we can decompose the output in a linear fashion, expressing it as a sum of contributions. These contributions can be easily calculated using linear projections. We...
Jun 17, 20245