Essentially: a button that when you press it kills the AI but instantly gives it reward equal to its expected discounted future reward. And to be clear, this is the AI's estimate of its expected discounted future reward, not some outside estimate.
(This is in the interest of asking dumb questions and learning in public. And I'm taking the specific narrow definition of corrigibility which is just an AI letting you turn it off.)
Thoughts: I've heard Dr. Christiano mention things like "precisely balanced incentives" (which I think uses to describe the weaknesses of this and similar approaches), but I don't see why this would be particularly difficult to balance given that this number is just an explicit float inside many RL models. Some issues I do see:
- Such an agent would have no incentive to create corrigible child-agents
- Such an agent would have no incentive to preserve this property while self-modifying
But I am probably missing a bunch of other issues with this general approach? Responses in the form of links to relevant papers are welcome as well.
Thanks,
Raf
The way I understand the idea, the point is to make the AI indifferent to being shutdown? It doesn't care one way or the other?
As you describe it, one of the problems is indeed that AI has no incentive to make it sure it (or other agents it creates) retain the property of shutdownability.
Another problem: it seems that there is an assumption is that once the AI is destroyed any further reward it gains is zero, so press-the-button vs not-press-the-button are perfectly balanced. But I don't think that's the case for most utility functions: they don't have to be conditional on the AI's existence, and the AI doesn't have to be indifferent to what the world looks like the AI it's destroyed. Could maybe be corrected by: the button gives the AI a reward equal to its expected discounted future reward, then sets any future reward gained to zero (including from pressing the button again).
Great answer, thanks!