LESSWRONG
LW

Meg
695000
Message
Dialogue
Subscribe

Posts

Sorted by New
141Auditing language models for hidden objectives
Ω
4mo
Ω
15
305Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Ω
1y
Ω
95
125Steering Llama-2 with contrastive activation additions
Ω
2y
Ω
29
66Towards Understanding Sycophancy in Language Models
Ω
2y
Ω
0
121Paper: LLMs trained on “A is B” fail to learn “B is A”
Ω
2y
Ω
74
109Paper: On measuring situational awareness in LLMs
Ω
2y
Ω
17

Wikitag Contributions

No wikitag contributions to display.

Comments

Sorted by
Newest
No Comments Found