It makes sense that one wants to stop the AI from optimising on a false objective (maximising button-presses). It would be ideal if the agent can be taught to ignore whichever of its actions occur by controlling the button.
In practise, a hack solution would be to use multiple buttons and multiple overseers rather than just one - I guess this will be a common suggestion. Having multiple overseers might weaken the problem, in that an agent would be more likely to learn that they all point to the same thing. I could also think of arguments that such an agent may nonetheless maximise its reward by forcing one or all of the overseers to press approval buttons.
It makes sense that one wants to stop the AI from optimising on a false objective (maximising button-presses). It would be ideal if the agent can be taught to ignore whichever of its actions occur by controlling the button.
In practise, a hack solution would be to use multiple buttons and multiple overseers rather than just one - I guess this will be a common suggestion. Having multiple overseers might weaken the problem, in that an agent would be more likely to learn that they all point to the same thing. I could also think of arguments that such an agent may nonetheless maximise its reward by forcing one or all of the overseers to press approval buttons.