This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Meg
Posts
Sorted by New
137
Auditing language models for hidden objectives
Ω
12d
Ω
7
305
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Ω
1y
Ω
95
125
Steering Llama-2 with contrastive activation additions
Ω
1y
Ω
29
66
Towards Understanding Sycophancy in Language Models
Ω
1y
Ω
0
121
Paper: LLMs trained on “A is B” fail to learn “B is A”
Ω
2y
Ω
74
109
Paper: On measuring situational awareness in LLMs
Ω
2y
Ω
16
Wikitag Contributions
Comments
Sorted by
Newest