User Comment Replies

This felt weird to me, so I tried to construct a non-math example. Suppose we have a reward learning agent where we have designed the reward space so that "ask the human whether to do X" always has higher reward than "do X". The agent is now considering whether to ask the human to try heroin, or just give them heroin. If the agent gives them heroin, it will see their look of ecstasy and will update to have the reward function "5 for giving the human heroin, 7 for asking the human". If the agent asks the human, then the human w... (read more)

2Stuart_Armstrong7y

Yep, that's a key part of the problem. We want to designed the AI to update according to what the human says; but what the human says is not a variable out there in the world that the AI discovers, it's something the AI can rig or influence through its own actions. This estimate depends on the agent's own actions (again, this is the heart of the problem).

Using lying to detect human values

Rohin Shah7y40

Pretty sure that he meant to say "an irrational agent" instead of "a rational agent", see https://arxiv.org/abs/1712.05812

2Stuart_Armstrong7y

Indeed! I've now corrected that error.

LESSWRONG
LW

All of Rohin Shah's Comments + Replies