Zeping Yu

Message

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

— by Zeping Yu, Dec 24, 2023 In this post, I present the paper “Exploring the Residual Stream of Transformers”. We aim to locate important parameters containing knowledge for next word prediction, and find the mechanism of the model to merge the knowledge into the final embedding. For sentence “The...

Dec 26, 2023•7

Message

6 karma

1 post

Member for 2 years

Zeping Yu — LessWrong

Zeping Yu

Message

Zeping Yu

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

Dec 26, 2023•7

Message

6 karma

1 post

Member for 2 years

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

Zeping Yu

— by Zeping Yu, Dec 24, 2023

In this post, I present the paper “Exploring the Residual Stream of Transformers”. We aim to locate important parameters containing knowledge for next word prediction, and find the mechanism of the model to merge the knowledge into the final embedding.

For sentence “The capital of France is” -> “Paris”, we find the knowledge is stored in both attention layers and FFN layers. Some attention subvalues provide knowledge “Paris is related to France”. Some FFN subvalues provide knowledge “Paris is a capital”, which are mainly activated by attention subvalues related to “capitals/cities”.

Overall, our contributions are as following. First, we explore the distribution change of residual connections in vocabulary... (read 3194 more words →)

LESSWRONG
LW

LESSWRONG
LW

Zeping Yu

Zeping Yu

Zeping Yu

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained

Zeping Yu

Zeping Yu

Zeping Yu

Exploring the Residual Stream of Transformers for Mechanistic Interpretability — Explained