This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Reward Functions
•
Applied to
Introduction to Choice set Misspecification in Reward Inference
by
Rahul Chand
18d
ago
•
Applied to
When is reward ever the optimization target?
by
Noosphere89
1mo
ago
•
Applied to
Interpreting Preference Models w/ Sparse Autoencoders
by
Logan Riggs
5mo
ago
•
Applied to
Reward hacking behavior can generalize across tasks
by
Kei
6mo
ago
•
Applied to
Speedrun ruiner research idea
by
lemonhope
7mo
ago
•
Applied to
Utility ≠ Reward
by
Oliver Sourbut
1y
ago
•
Applied to
Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI
by
jacobjacob
1y
ago
•
Applied to
VLM-RM: Specifying Rewards with Natural Language
by
ChengCheng
1y
ago
•
Applied to
Some alignment ideas
by
SelonNerias
1y
ago
•
Applied to
self-improvement-executors are not goal-maximizers
by
bhauth
1y
ago
•
Applied to
Shutdown-Seeking AI
by
Simon Goldstein
1y
ago
•
Applied to
Language Agents Reduce the Risk of Existential Catastrophe
by
cdkg
1y
ago
•
Applied to
A Short Dialogue on the Meaning of Reward Functions
by
Leon Lang
2y
ago
•
Applied to
Learning societal values from law as part of an AGI alignment strategy
by
John Nay
2y
ago
•
Applied to
Scaling Laws for Reward Model Overoptimization
by
David Gross
2y
ago
•
Applied to
Four usages of "loss" in AI
by
TurnTrout
2y
ago
•
Applied to
Reward IS the Optimization Target
by
RobertM
2y
ago