LESSWRONG
Wikitags
LW

Subscribe
Discussion0
Sycophancy

Sycophancy

Subscribe
Discussion0
This page is a stub.
Posts tagged Sycophancy
2
161Sycophancy to subterfuge: Investigating reward tampering in large language models
Ω
Carson Denison, evhub
11mo
Ω
22
2
125Steering Llama-2 with contrastive activation additions
Ω
Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub, TurnTrout
1y
Ω
29
1
122Reducing sycophancy and improving honesty via activation steering
Ω
Nina Panickssery
2y
Ω
18
1
29SAE features for refusal and sycophancy steering vectors
Ω
neverix, Dmitrii Kharlapenko, Arthur Conmy, Neel Nanda
7mo
Ω
4
1
9Two new datasets for evaluating political sycophancy in LLMs
Ω
alma.liezenga
7mo
Ω
0
1
6Towards a Science of Evals for Sycophancy
andrejfsantos
3mo
0
1
3Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience
rife
3mo
18
1
2Evaluating LLaMA 3 for political sycophancy
alma.liezenga
7mo
2
1
-8Antagonistic AI
Xybermancer
1y
1
Add Posts