In the indifference formalism the agent in selecting A1 act like a UN agent that believe that the shutdown button will not be pressed, therefore it create perverse incentives to "manage the news". Which means that if the agent can cause his shutdown button to be pressed in the event of bad news, it will.
My formulation avoid this pathological behavior by instead making the agent select A1 as if it was a UN-agent which believed that it would continue to optimize according to UN even in the vent of the button being pressed which avoid the perverse incentives to "manage the news", while still not having any incentives to avoid the button being pressed because the agent will act like it believe that pressing the button will not cause it to initiate a shutdown.
Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:
We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:
(See the paper for references.)
This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.