Essentially: a button that when you press it kills the AI but instantly gives it reward equal to its expected discounted future reward. And to be clear, this is the AI's estimate of its expected discounted future reward, not some outside estimate.
(This is in the interest of asking dumb questions and learning in public. And I'm taking the specific narrow definition of corrigibility which is just an AI letting you turn it off.)
Thoughts: I've heard Dr. Christiano mention things like "precisely balanced incentives" (which I think uses to describe the weaknesses of this and similar approaches), but I don't see why this would be particularly difficult to balance given that this number is just an explicit float inside many RL models. Some issues I do see:
- Such an agent would have no incentive to create corrigible child-agents
- Such an agent would have no incentive to preserve this property while self-modifying
But I am probably missing a bunch of other issues with this general approach? Responses in the form of links to relevant papers are welcome as well.
Thanks,
Raf
The point about agents in environment suggests that shutdown is not really about corrigibility, it's more of a test case (application) for corrigibility. If an agent can trivially create other agents in environment, and poses a risk of actually doing so, then shutting it down doesn't resolve that risk, so you'd need to take care of the more general problem first. Not creating agents in environment seems closer to the soft optimization side of corrigibility: preferring less dangerous cognition, not being an optimizer. The agent not contesting or assisting shutdown is still useful when the risk is not there or doesn't actually trigger, but it's not necessarily related to corrigibility, other than in the sense of being an important task for corrigible agents to support.