LESSWRONG
LW

dlaptev

Message

Cross-Layer Feature Alignment and Steering in Large Language Model

The text below is a brief summary of our research in mechanistic interpretability. First, this article discusses the motivation behind our work. Second, it provides an overview of our previous work. Finally, we outline the future directions we consider important. Introduction and Motivation Large language models (LLMs) often represent concepts...

Feb 8, 2025•9

dlaptev

dlaptev — LessWrong

dlaptev

Message

Cross-Layer Feature Alignment and Steering in Large Language Model

Feb 8, 2025•9

dlaptev

Cross-Layer Feature Alignment and Steering in Large Language Model

dlaptev

Introduction and Motivation

Large language models (LLMs) often represent concepts as linear directions (or "features") within hidden activation spaces [3,4]. Sparse Autoencoders (SAEs) [5–8] help disentangle these hidden states into a large number of monosemantic features, each capturing a specific semantic thread. While one can train a separate SAE on each layer to discover its features, a key unanswered question remains: how do these features persist or transform from layer to layer?

We have approached this... (read 1682 more words →)