Delusion Box

Ring and Orseau (2011)¹ consider the possibility of an agent sabotaging its own information source by placing itself in a delusion box in which the agent controls its own inputs, decreasing the amount of information that its inputs give the agent about its environment. Ring and Orseau use it as an example of what can go wrong when one is not careful about how an agent is defined, arguing that many simple agents would put themselves in a delusion box. For instance, a reinforcement-learning agent has a utility function that depends purely on its inputs, not on any other features of the external world, so if a reinforcement-learning agent had the opportunity to replace its input with whatever input indicates maximum reward, it would do so, regardless of the effect on the external environment.

Eliezer Yudkowsky had previously argued² that AIXI, an optimal reinforcement-learning agent, would be able to find ways to control its reward channel, and would thus not do whatever its users attempted to use its reward channel to train it to do, a similar argument to Ring and Orseau's delusion box idea.

LESSWRONG
is fundraising!
Tags
LW
$

References

See also