This felt weird to me, so I tried to construct a non-math example. Suppose we have a reward learning agent where we have designed the reward space so that "ask the human whether to do X" always has higher reward than "do X". The agent is now considering whether to ask the human to try heroin, or just give them heroin. If the agent gives them heroin, it will see their look of ecstasy and will update to have the reward function "5 for giving the human heroin, 7 for asking the human". If the agent asks the human, then the human will say "no", and the agent will update to have the reward function "-1 for giving the human heroin, 1 for asking the human". In both cases asking the human is the optimal action, yet the agent will end up giving the human heroin without asking.
This seems isomorphic to the example you gave, and it's a little clearer what I find weird:
The agent _knows_ how it's going to update based on the action it takes. This feels wrong to me. Though I think the conclusion remains even if the agent only has probabilistic beliefs about how it will update based on the action it takes.
Can we simply make sure that the agent selects its action according to the current estimate of the reward function (or the mean reward function if you have a probability distribution), and only updates after seeing the result of the action? This avoids this problem, and the problem in Towards Interactive Inverse Reinforcement Learning, and seems like the approach taken in Deep Reinforcement Learning from Human Preferences. (Such an agent could be a biased learning process, and be safe anyway.)
This felt weird to me, so I tried to construct a non-math example. Suppose we have a reward learning agent where we have designed the reward space so that "ask the human whether to do X" always has higher reward than "do X". The agent is now considering whether to ask the human to try heroin, or just give them heroin. If the agent gives them heroin, it will see their look of ecstasy and will update to have the reward function "5 for giving the human heroin, 7 for asking the human". If the agent asks the human, then the human will say "no", and the agent will update to have the reward function "-1 for giving the human heroin, 1 for asking the human". In both cases asking the human is the optimal action, yet the agent will end up giving the human heroin without asking.
This seems isomorphic to the example you gave, and it's a little clearer what I find weird: