Hi Nate, interesting work.
I don't understand the assumptions behind the corrigibility problem. According to the intelligence explosion thesis, a self-improving AI will spend very little time in the near-human intelligence interval. Thus, most of the time it will be either far subhuman or far superhuman. In the far subhuman region the AIs manipulations against its programmers don't seem to be a concern. In the far superhuman region fixing bugs seems to be way too late. In addition, it seems infeasible to debug the AI at this stage since it would have rewritten its own source codes into something humans probably cannot understand.
When building an AGI, it's quite prudent to expect that you didn't get everything exactly right on the first try. Therefore, it's important to build systems that are amenable to modification, that don't tile the universe halfway through value loading, etc. etc. In other words, even if you could kick off an intelligence explosion that quickly makes a system which you have no control over, this is probably a bad plan if you haven't done a whole hell of a lot of hard work verifying the system and ensuring that it is aligned with your interests first, and so o...
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.