Essentially: a button that when you press it kills the AI but instantly gives it reward equal to its expected discounted future reward. And to be clear, this is the AI's estimate of its expected discounted future reward, not some outside estimate.
(This is in the interest of asking dumb questions and learning in public. And I'm taking the specific narrow definition of corrigibility which is just an AI letting you turn it off.)
Thoughts: I've heard Dr. Christiano mention things like "precisely balanced incentives" (which I think uses to describe the weaknesses of this and similar approaches), but I don't see why this would be particularly difficult to balance given that this number is just an explicit float inside many RL models. Some issues I do see:
- Such an agent would have no incentive to create corrigible child-agents
- Such an agent would have no incentive to preserve this property while self-modifying
But I am probably missing a bunch of other issues with this general approach? Responses in the form of links to relevant papers are welcome as well.
Thanks,
Raf
AI will press the button itself?
If implemented as described, the AI should be exactly indifferent to pushing the button? I guess the AI’s behavior in that situation is not well defined… and if we make the button give expected value minus epsilon reward, then the AI might kill you to stop you from pressing the button (because it wants that epsilon reward!)
So overall I suppose this is a fair criticism of the approach and is possibly what Paul means by issues with precisely balancing!