When is reward ever the optimization target?

Noosphere89

37

[ Question ]

When is reward ever the optimization target?

by Noosphere89

15th Oct 2024

1 min read

4 17

37

Alright, I have a question stemming from TurnTrout's post on Reward is not the optimization target, where he argues that the premises that are required to get to the conclusion of reward being the optimization target are so narrowly applicable as to not apply to future RL AIs as they gain more and more power:

https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target#When_is_reward_the_optimization_target_of_the_agent_

But @gwern argued with Turntrout that reward is in fact the optimization target for a broad range of RL algorithms:

https://www.lesswrong.com/posts/ttmmKDTkzuum3fftG/#sdCdLw3ggRxYik385

https://www.lesswrong.com/posts/nmxzr2zsjNtjaHh7x/actually-othello-gpt-has-a-linear-emergent-world#Tdo7S62iaYwfBCFxL

So my question is are there known results, ideally proofs, but I can accept empirical studies if necessary that show when RL algorithms treat the reward function as an optimization target?

And how narrow is the space of RL algorithms that don't optimize for the reward function?

A good answer will link to results known in the RL literature that are relevant to the question, and give conditions under which a RL agent does or doesn't optimize the reward function.

The best answers will present either finite-time results on RL algorithms optimizing the reward function, or argue that the infinite limit abstraction is a reasonable approximation to the actual reality of RL algorithms.

I'd like to know which RL algorithms optimize the reward, and which do not.

^{^}

For more than you want to know about the various terminologies, see How sequential interactive processing within frontostriatal loops supports a continuum of habitual to controlled processing.

We debated the terminologies habitual/goal-directed, automatic and controlled, system 1/system 2, and model-free/model-based for years. All of them have limitations, and all of them mean slightly different things. In particular, model-based is vague terminology when systems get more complex than simple RL - but it is very clear that many complex human decisions (certainly ones in which we envision possible outcomes before taking actions) are far on the model-based side, and meet every definition.

^{^}

For more than you want to know about the various terminologies, see How sequential interactive processing within frontostriatal loops supports a continuum of habitual to controlled processing.

^{^}

One follow-on question is whether RL-based AGI will wirehead. I think this is almost the same question as getting into the experience box - except that that box will only keep going if the AGI engineers it correctly to keep going. So it's going to have to do a lot of planning before wireheading, unless its decision-making algorithm is highly biased toward near-term rewards over long-term ones. In the course of doing that planning, its other motivations will come into play - like the well-being of humans, if it cares about that. So whether or not our particular AGI will wirehead probably won't determine our fate.

Reward Functions2AI3