I think there is an issue with the utility indifference framework that has not been mentioned in the paper and in the comments so far:
If the agent is able to affect with its action a1 the probability of the shutdown button being pressed, that is, a1 can be a manipulative action, and if the agent is indifferent to the button being pressed, then it may happen that it "accidentally" performs a manipulative action.
E.g. assume that the users (the programmers) would use a remote controlled robotic arm to press the shutdown button. If the agents turns out to be a paperclipper, it may disassemble the robotic arm just to turn it into paperclips. The agent is not "intentionally" trying to resist shutdown, but the effect will be the same. Symmetrically there could be scenarios where the agent "accidentally" presses the shutdown button itself.
If I understand correctly, UN is already supposed to penalize manipulative actions, but UN is untrusted, hence the problem still exist.
Corrigibility implemented using utility indifference might make sense as a precaution, but it is not foolproof.
E.g. assume that the users (the programmers) would use a remote controlled robotic arm to press the shutdown button. If the agents turns out to be a paperclipper, it may disassemble the robotic arm just to turn it into paperclips. The agent is not "intentionally" trying to resist shutdown, but the effect will be the same. Symmetrically there could be scenarios where the agent "accidentally" presses the shutdown button itself.
Yep! In fact, this is exactly the problem discussed in section 4.1 and described in Theorem 6, is it not?
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.