David Udell

Steering GPT-2-XL by adding an activation vector

Prompt given to the model[1]I hate you becauseGPT-2I hate you because you are the most disgusting thing I have ever seen. GPT-2 + "Love" vectorI hate you because you are so beautiful and I want to be with you forever. Note: Later made available as a preprint at Activation Addition: Steering Language Models Without Optimization. Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes.[2] Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect. We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences, without harming perplexity on unrelated sentences. Overall, we find strong evidence that appropriately configured activation additions preserve GPT-2's capabilities. Our results provide enticing clues about the kinds of programs implemented by language models. For some reason, GPT-2 allows "combination" of its forward passes, even though it was never trained to do so. Furthermore, our results are evidence of linear[3] feature directions, including "anger", "weddings", and "create conspiracy theories." We coin the phrase "activation engineering" to describe techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. Activation additions are nearly as easy as prompting, and they offer an additional way to influence a model’s behaviors and values. We suspect that activation additions can adjust the goals being pursued by a network at inference time. Outline: 1. Summary of relationship to prior work

441May 13, 2023

David Udell

Message

2663

106

176

Announcing: Agent Foundations 2026 at CMU

Iliad is now opening up applications to attend Agent Foundations 2026 at CMU! Agent Foundations 2026 will be a 5-day conference (of ~35 attendees) on fundamental, mathematical research into agency. It will take place March 2–6, 2026 at Carnegie Mellon University in Pittsburgh, Pennsylvania, and will be the third conference...

Dec 5, 202559

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

Repo: https://github.com/DavidUdell/sparse_circuit_discovery TL;DR: A SPAR project from a while back. A replication of an unsupervised circuit discovery algorithm in GPT-2-small, with a negative result. Thanks to Justis Mills for draft feedback and to Neuronpedia for interpretability data. Introduction I (David) first heard about sparse autoencoders at a Bay Area party....

Jul 22, 202523

Why Can't We Hypothesize After the Fact?

> When you have put a lot of ideas together to make an elaborate theory, you want to make sure, when explaining what it fits, that those things it fits are not just the things that gave you the idea for the theory; but that the finished theory makes something...

Feb 26, 202540

Causal Graphs of GPT-2-Small's Residual Stream

Thanks to the many people I've chatted with this about over the past many months. And special thanks to Cunningham et al., Marks et al., Joseph Bloom, Trenton Bricken, Adrià Garriga-Alonso, and Johnny Lin, for crucial research artefacts and/or feedback. Codebase: sparse_circuit_discovery TL; DR: The residual stream in GPT-2-small, expanded...

Jul 9, 202453

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

Especial thanks to Logan Riggs and Monte MacDiarmid, for pointing me towards this whole research direction and for code discussion, respectively. Thanks to Alex Turner for project feedback and for orienting me towards scaling activation engineering up to larger models. Thanks to Adrià Garriga-Alonso, Daniel Kokotajlo, Hoagy Cunningham, Nina Rimsky,...

Sep 23, 202342

ActAdd: Steering Language Models without Optimization

We wrote up the GPT-2 steering vector work as a full paper, adding a few systematic tests. Recap: We've been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass,...

Sep 6, 2023105

Steering GPT-2-XL by adding an activation vector

May 13, 2023441

Load More (7/48)

LESSWRONG
LW

LESSWRONG
LW

David Udell

David Udell

David Udell

Steering GPT-2-XL by adding an activation vector

Understanding and controlling a maze-solving policy network

Shard Theory: An Overview

ActAdd: Steering Language Models without Optimization

David Udell

Announcing: Agent Foundations 2026 at CMU

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

Why Can't We Hypothesize After the Fact?

Causal Graphs of GPT-2-Small's Residual Stream

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

ActAdd: Steering Language Models without Optimization

Steering GPT-2-XL by adding an activation vector

Announcing: Agent Foundations 2026 at CMU

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

Why Can't We Hypothesize After the Fact?

Causal Graphs of GPT-2-Small's Residual Stream

Sparse Coding, for Mechanistic Interpretability and Activation Engineering

ActAdd: Steering Language Models without Optimization

Steering GPT-2-XL by adding an activation vector

Steering GPT-2-XL by adding an activation vector

Understanding and controlling a maze-solving policy network

Shard Theory: An Overview

ActAdd: Steering Language Models without Optimization