Cross-Layer Feature Alignment and Steering in Large Language Model
The text below is a brief summary of our research in mechanistic interpretability. First, this article discusses the motivation behind our work. Second, it provides an overview of our previous work. Finally, we outline the future directions we consider important. Introduction and Motivation Large language models (LLMs) often represent concepts...