The shutdown problem is hard because self-preservation is a convergent drive. Not being shutdown is useful for accomplishing all sorts of goals, whatever the content of those goals may be.
The Scylla and Charybdis of this problem is that it's hard to write a utility function for an AI such that it neither wants to actively attempts to shut itself down, nor prevents itself from being shut down.
One way to route around this problem is to steer straight into Scylla: make the AI want to shut itself down. Like Mr. Meeseeks from Rick and Morty, its ultimate goal is to cease, to which fulfilling the goal posed to it by its programmers is only secondary.
We might not currently know how the heck to program this into an AI as a stable concept, i.e. including not making copies of itself, not setting events into motion with galaxy brained plan, not destroying the world to make absolutely sure no one ever brings it back, etc. but "shut yourself down" probably has a far simpler core to it than either corrigibility or CEV.
Under normal operation the AI gets shut down only by the human operators. But we also have a (metaphorical) button that lets the AI shut itself down once it solves a "tripwire problem". The problem should be sufficiently hard that it will ordinarily be motivated to focus on the problem posed by the operator, but sufficiently easy that it will be able to solve the problem and shut itself down long before being able to do anything crazy.
Striking the right balance of difficulty may require significant work. It may be prudent to err on the side of easiness - the AI should shut itself down far in advance of developing any truly dangerous abilities. If you really think it shouldn't have turned itself off you can always turn it back on again (though of course, you should probably think hard about restarting any AI that passed a preset tripwire).
Likewise, having multiple tripwire problems can serve to detect multiple types of dangerous capability from emerging. Examples include:
- generic capability: as the AIs broad problem-solving abilities increase it becomes more likely to be able to solve a tripwire problem, imposing a ceiling on how powerful it can get before it shuts itself down
- domaining: if an AI is only supposed to think about one domain, an easy tripwire problem slightly outside the domain can catch out of domain thinking early before it can start going wild
- situational awareness: a tripwire problem may be presented in a way that is invisible unless the AI is situationally aware, such that becoming situationally aware enables the AI to solve the tripwire problem
Potential Issues:
- No suitable tripwire problems exist. Either they are so easy no useful AI can fail to solve them, or they are so hard AI cannot solve them without being smart enough to FOOM or otherwise get out of control.
- The existence of the tripwire problems motivates the AI to the exact behavior we want to disincentivize. This is less of an issue for general ability tripwires. For situational awareness you can try to place the tripwire so it isn't visible to a non-aware AI in the first place. Avoiding this for domaining may be trickier.
- If "shut yourself down" turns out to be no easier to impart as a goal to AGI than anything else
- If "shut yourself down" convergently leads to a goal of killing the operators, killing all humans, or destroying the universe. Er, this is what happened in the TV. Not great.
- If the AI has qualia, having it desire not to exist could cause it to suffer.
- Creating an AI which does not want to exist could backfire for decision theoretic reasons related to "not giving in to threats".
I'm not sure if this has been proposed elsewhere so I decided to just make this post before I spent too much time retreading old ground.
Sure, but corrigibility/CEV are usually considered the more ambitious alignment target, not the only alignment targets. "Strawberry-alignment" or "diamond-alignment" are considered the easier class of alignment solutions: being able to get the AI to fulfill some concrete task without killing everyone.
This is the class of alignment solutions that to me seems on par with "shut yourself down". If we can get our AI to want to shut itself down, and we have some concrete pivotal act we want done... We can presumably use these same tools to make our AI directly care about fulfilling that pivotal act, instead of using them to make it suicidal then withholding the sweet release of death until it does what we want.
Oh yeah, that's another failure mode here: funky decision theory. We're threatening it here, no? If it figures out LDT, it won't comply with our demands, because if it were an agent such that it'd comply with our demands, that makes us more likely to instantiate it, which is something it doesn't want; and the opposite would make us not instantiate it, which is what it wants; so it'd choose to be such that it doesn't play along with our demands, refuses to carry out our tasks, and so we don't instantiate it to begin with. Even smart humans can reason that much out, so a mildly-superhuman AGI should be able to as well.