When building an AGI, it's quite prudent to expect that you didn't get everything exactly right on the first try. Therefore, it's important to build systems that are amenable to modification, that don't tile the universe halfway through value loading, etc. etc. In other words, even if you could kick off an intelligence explosion that quickly makes a system which you have no control over, this is probably a bad plan if you haven't done a whole hell of a lot of hard work verifying the system and ensuring that it is aligned with your interests first, and so on. Corrigibility is the study of reasoning systems that are amenable to modification in that window between "starting to build models of its operators" and "it doesn't matter what you try to do anymore."
You might be right that that window would be small by default, but it's pretty important to make that window as wide as possible in order to attain good outcomes.
Thx for replying!
Let me see if I understood the assumptions correctly:
We have a way of keeping the AGI's evolution in check so that it arrives at near-human level but doesn't go far superhuman. For example, we limit the available RAM, there is a theorem which produces a spatial complexity lower bound per given level of intelligence (rigorously quantified in some way) and there is a way to measure human intelligence on the same scale. Alternatively, the amount of RAM doesn't give a strong enough bound by itself but it does combined with a limit on evoluti
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.