Thx for replying!
Let me see if I understood the assumptions correctly:
We have a way of keeping the AGI's evolution in check so that it arrives at near-human level but doesn't go far superhuman. For example, we limit the available RAM, there is a theorem which produces a spatial complexity lower bound per given level of intelligence (rigorously quantified in some way) and there is a way to measure human intelligence on the same scale. Alternatively, the amount of RAM doesn't give a strong enough bound by itself but it does combined with a limit on evolution time starting from seed AI.
We are reasonably confident the AGI follows the utility function we program into it and this property is stable with respect to self-modification.
We are not reasonably confidently the utility function used in practice is actually friendly (although a serious attempt to make it friendly has been made).
We are reasonably confident in the ability to formally describe conditions such as "shutdown when button is pressed".
Is this about right?
Those points roughly describe the assumptions you'd have to make to think that the shutdown problem in particular (and "solutions" such as the one in the paper) were valid, I suppose. The study of corrigibility more broadly does not depend upon these assumptions necessarily, though -- the overall question that the study of corrigibility attempts to answer is this: given that your utility function probably won't be "friendly" on the first try, what sort of system do you want to build? As you've noticed, there are many open problems in th...
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.