This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Sycophancy
•
Applied to
SAE features for refusal and sycophancy steering vectors
by
neverix
1mo
ago
•
Applied to
Evaluating LLaMA 3 for political sycophancy
by
alma.liezenga
2mo
ago
•
Applied to
Two new datasets for evaluating political sycophancy in LLMs
by
alma.liezenga
2mo
ago
•
Applied to
Sycophancy to subterfuge: Investigating reward tampering in large language models
by
Raemon
5mo
ago
•
Applied to
Antagonistic AI
by
Xybermancer
9mo
ago
•
Applied to
Steering Llama-2 with contrastive activation additions
by
TurnTrout
11mo
ago
•
Applied to
Reducing sycophancy and improving honesty via activation steering
by
Maxime Riché
11mo
ago
•
Created by
Maxime Riché
at
11mo