This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Activation Engineering
•
Applied to
A Sober Look at Steering Vectors for LLMs
by
Joschka Braun
9d
ago
•
Applied to
Avoiding jailbreaks by discouraging their representation in activation space
by
Guido Bergman
2mo
ago
•
Applied to
Validating / finding alignment-relevant concepts using neural data
by
Bogdan Ionut Cirstea
2mo
ago
•
Applied to
[Paper] Programming Refusal with Conditional Activation Steering
by
Bruce W. Lee
2mo
ago
•
Applied to
Activation Engineering Theories of Impact
by
kubanetics
4mo
ago
•
Applied to
I found >800 orthogonal "write code" steering vectors
by
Jacob G-W
4mo
ago
•
Applied to
An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
by
Jan Wehner
4mo
ago
•
Applied to
Control Vectors as Dispositional Traits
by
Jan Wehner
5mo
ago
•
Applied to
Representation Tuning
by
Jan Wehner
5mo
ago
•
Applied to
LLMs Universally Learn a Feature Representing Token Frequency / Rarity
by
Sean Osier
5mo
ago
•
Applied to
Jailbreak steering generalization
by
Nina Panickssery
5mo
ago
•
Applied to
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
by
Henry Cai
5mo
ago
•
Applied to
Introducing SARA: a new activation steering technique
by
Alejandro Tlaie
6mo
ago
•
Applied to
Mechanistically Eliciting Latent Behaviors in Language Models
by
TurnTrout
7mo
ago
•
Applied to
How well do truth probes generalise?
by
mishajw
9mo
ago