Forum Digest: Corrigibility, utility indifference, & related control ideas
This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent's goal system wrong, it doesn't try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent's goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It's current as of 3/21/15. Papers As background to the posts listed below, the following two papers may be helpful. * Corrigibility, by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong (2015). This paper introduces the problem of corrigibility and analyzes some simple models, including a version of Stuart Armstrong's utility indifference. Abstract: > As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open. * Utility Indifference, by Stuart Armstrong (2010). An older paper by Stuart explaining the utility indifference approach. Corrigibility * Generalizing the Corrigibility paper's impossibility result?, Benja Fallenstein. The Corrigibility paper looks at a p
We should be more careful, though, about what we mean by saying that φ(x) only depends on Trm for m>n, though, since this cannot be a purely syntactic criterion if we allow quantification over the subscript (as I did here). I'm pretty sure that something can be worked out, but I'll leave it for the moment.