This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
Reward Functions
Edit
History
Subscribe
Discussion
(0)
Help improve this page (2 flags)
Edit
History
Subscribe
Discussion
(0)
Help improve this page (2 flags)
Reward Functions
Random Tag
Contributors
Posts tagged
Reward Functions
Most Relevant
5
376
Reward is not the optimization target
Ω
TurnTrout
2y
Ω
123
5
47
Draft papers for REALab and Decoupled Approval on tampering
Ω
Jonathan Uesato
,
Ramana Kumar
4y
Ω
2
2
103
Scaling Laws for Reward Model Overoptimization
Ω
leogao
,
John Schulman
,
Jacob_Hilton
2y
Ω
13
2
87
Seriously, what goes wrong with "reward the agent when it makes you smile"?
Q
Ω
TurnTrout
,
johnswentworth
2y
Q
Ω
42
2
74
Interpreting Preference Models w/ Sparse Autoencoders
Ω
Logan Riggs
,
Jannik Brinkmann
5mo
Ω
12
2
46
Four usages of "loss" in AI
Ω
TurnTrout
2y
Ω
18
2
40
Intrinsic Drives and Extrinsic Misuse: Two Intertwined Risks of AI
jsteinhardt
1y
0
2
39
Language Agents Reduce the Risk of Existential Catastrophe
Ω
cdkg
,
Simon Goldstein
1y
Ω
14
2
33
When is reward ever the optimization target?
Q
Noosphere89
1mo
Q
12
2
20
$100/$50 rewards for good references
Ω
Stuart_Armstrong
3y
Ω
5
2
13
Why we want unbiased learning processes
Stuart_Armstrong
7y
3
2
5
Learning societal values from law as part of an AGI alignment strategy
John Nay
2y
18
1
130
Utility ≠ Reward
Ω
Vlad Mikulik
5y
Ω
24
1
78
Reward hacking behavior can generalize across tasks
Ω
Kei
,
Isaac Dunn
,
Henry Sleight
,
Miles Turpin
,
evhub
,
Carson Denison
,
Ethan Perez
6mo
Ω
5
1
50
Shutdown-Seeking AI
Ω
Simon Goldstein
1y
Ω
32