This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Wikitags
LW
Login
Subscribe
Discussion
0
Sycophancy
Sycophancy
Subscribe
Discussion
0
Summaries
Cancel
Submit
This page is a stub.
Posts tagged
Sycophancy
Most Relevant
2
161
Sycophancy to subterfuge: Investigating reward tampering in large language models
Ω
Carson Denison
,
evhub
9mo
Ω
22
2
125
Steering Llama-2 with contrastive activation additions
Ω
Nina Panickssery
,
Wuschel Schulz
,
NickGabs
,
Meg
,
evhub
,
TurnTrout
1y
Ω
29
1
122
Reducing sycophancy and improving honesty via activation steering
Ω
Nina Panickssery
2y
Ω
18
1
29
SAE features for refusal and sycophancy steering vectors
Ω
neverix
,
Dmitrii Kharlapenko
,
Arthur Conmy
,
Neel Nanda
5mo
Ω
4
1
9
Two new datasets for evaluating political sycophancy in LLMs
Ω
alma.liezenga
6mo
Ω
0
1
6
Towards a Science of Evals for Sycophancy
andrejfsantos
2mo
0
1
3
Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience
rife
2mo
18
1
2
Evaluating LLaMA 3 for political sycophancy
alma.liezenga
6mo
2
1
-8
Antagonistic AI
Xybermancer
1y
1