LESSWRONG
is fundraising!
Tags
LW
$

Transformer Circuits

•

Applied to Are SAE features from the Base Model still meaningful to LLaVA? by Shan23Chen 18d ago

•

Applied to Concrete Methods for Heuristic Estimation on Neural Networks by Oliver Daniels 1mo ago

•

Applied to Open Source Replication of Anthropic’s Crosscoder paper for model-diffing by Connor Kissane 2mo ago

•

Applied to Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models? by Taras Kutsyk 3mo ago

•

Applied to SAEs (usually) Transfer Between Base and Chat Models by Connor Kissane 5mo ago

•

Applied to Arrakis - A toolkit to conduct, track and visualize mechanistic interpretability experiments. by Yash Srivastava 5mo ago

•

Applied to An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda 6mo ago

•

Applied to Logit Prisms: Decomposing Transformer Outputs for Mechanistic Interpretability by ntt123 6mo ago

•

Applied to "What the hell is a representation, anyway?" | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents by IwanWilliams 6mo ago

•

Applied to Finding Backward Chaining Circuits in Transformers Trained on Tree Search by abhayesian 7mo ago

•

Applied to Can quantised autoencoders find and interpret circuits in language models? by charlieoneill 9mo ago

•

Applied to Sparse Autoencoders Work on Attention Layer Outputs by robertzk 1y ago

•

Applied to Finding Sparse Linear Connections between Features in LLMs by Logan Riggs 1y ago

•

Applied to AISC project: TinyEvals by Jett Janiak 1y ago

•

Applied to Polysemantic Attention Head in a 4-Layer Transformer by Jett Janiak 1y ago

•

Applied to Graphical tensor notation for interpretability by Jordan Taylor 1y ago

•

Applied to Interpreting OpenAI's Whisper by Neel Nanda 1y ago

•

Applied to Automatically finding feature vectors in the OV circuits of Transformers without using probing by Jacob Dunefsky 1y ago

•

Applied to An adversarial example for Direct Logit Attribution: memory management in gelu-4l by Can 1y ago