This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Meg
Posts
Sorted by New
305
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Ω
10mo
Ω
95
123
Steering Llama-2 with contrastive activation additions
Ω
11mo
Ω
29
66
Towards Understanding Sycophancy in Language Models
Ω
1y
Ω
0
120
Paper: LLMs trained on “A is B” fail to learn “B is A”
Ω
1y
Ω
74
108
Paper: On measuring situational awareness in LLMs
Ω
1y
Ω
16
Wiki Contributions
Comments
Sorted by
Newest