This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Meg
Posts
Sorted by New
291
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Ω
4mo
Ω
94
120
Steering Llama-2 with contrastive activation additions
Ω
4mo
Ω
29
66
Towards Understanding Sycophancy in Language Models
Ω
6mo
Ω
0
120
Paper: LLMs trained on “A is B” fail to learn “B is A”
Ω
7mo
Ω
73
106
Paper: On measuring situational awareness in LLMs
Ω
8mo
Ω
16
Wiki Contributions
Comments