x

LESSWRONG
LW

Meg — LessWrong

Meg

703000

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

No wikitag contributions to display.

No Comments Found

145Auditing language models for hidden objectives

11mo

15

310Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

2y

95

125Steering Llama-2 with contrastive activation additions

2y

29

66Towards Understanding Sycophancy in Language Models

2y

0

125Paper: LLMs trained on “A is B” fail to learn “B is A”

2y

74

109Paper: On measuring situational awareness in LLMs

2y

17