This felt weird to me, so I tried to construct a non-math example. Suppose we have a reward learning agent where we have designed the reward space so that "ask the human whether to do X" always has higher reward than "do X". The agent is now considering whether to ask the human to try heroin, or just give them heroin. If the agent gives them heroin, it will see their look of ecstasy and will update to have the reward function "5 for giving the human heroin, 7 for asking the human". If the agent asks the human, then the human w... (read more)
Yep, that's a key part of the problem. We want to designed the AI to update according to what the human says; but what the human says is not a variable out there in the world that the AI discovers, it's something the AI can rig or influence through its own actions.
This estimate depends on the agent's own actions (again, this is the heart of the problem).
This felt weird to me, so I tried to construct a non-math example. Suppose we have a reward learning agent where we have designed the reward space so that "ask the human whether to do X" always has higher reward than "do X". The agent is now considering whether to ask the human to try heroin, or just give them heroin. If the agent gives them heroin, it will see their look of ecstasy and will update to have the reward function "5 for giving the human heroin, 7 for asking the human". If the agent asks the human, then the human w... (read more)