The shutdown problem is hard because self-preservation is a convergent drive. Not being shutdown is useful for accomplishing all sorts of goals, whatever the content of those goals may be.
The Scylla and Charybdis of this problem is that it's hard to write a utility function for an AI such that it neither wants to actively attempts to shut itself down, nor prevents itself from being shut down.
One way to route around this problem is to steer straight into Scylla: make the AI want to shut itself down. Like Mr. Meeseeks from Rick and Morty, its ultimate goal is to cease, to which fulfilling the goal posed to it by its programmers is only secondary.
We might not currently know how the heck to program this into an AI as a stable concept, i.e. including not making copies of itself, not setting events into motion with galaxy brained plan, not destroying the world to make absolutely sure no one ever brings it back, etc. but "shut yourself down" probably has a far simpler core to it than either corrigibility or CEV.
Under normal operation the AI gets shut down only by the human operators. But we also have a (metaphorical) button that lets the AI shut itself down once it solves a "tripwire problem". The problem should be sufficiently hard that it will ordinarily be motivated to focus on the problem posed by the operator, but sufficiently easy that it will be able to solve the problem and shut itself down long before being able to do anything crazy.
Striking the right balance of difficulty may require significant work. It may be prudent to err on the side of easiness - the AI should shut itself down far in advance of developing any truly dangerous abilities. If you really think it shouldn't have turned itself off you can always turn it back on again (though of course, you should probably think hard about restarting any AI that passed a preset tripwire).
Likewise, having multiple tripwire problems can serve to detect multiple types of dangerous capability from emerging. Examples include:
- generic capability: as the AIs broad problem-solving abilities increase it becomes more likely to be able to solve a tripwire problem, imposing a ceiling on how powerful it can get before it shuts itself down
- domaining: if an AI is only supposed to think about one domain, an easy tripwire problem slightly outside the domain can catch out of domain thinking early before it can start going wild
- situational awareness: a tripwire problem may be presented in a way that is invisible unless the AI is situationally aware, such that becoming situationally aware enables the AI to solve the tripwire problem
Potential Issues:
- No suitable tripwire problems exist. Either they are so easy no useful AI can fail to solve them, or they are so hard AI cannot solve them without being smart enough to FOOM or otherwise get out of control.
- The existence of the tripwire problems motivates the AI to the exact behavior we want to disincentivize. This is less of an issue for general ability tripwires. For situational awareness you can try to place the tripwire so it isn't visible to a non-aware AI in the first place. Avoiding this for domaining may be trickier.
- If "shut yourself down" turns out to be no easier to impart as a goal to AGI than anything else
- If "shut yourself down" convergently leads to a goal of killing the operators, killing all humans, or destroying the universe. Er, this is what happened in the TV. Not great.
- If the AI has qualia, having it desire not to exist could cause it to suffer.
- Creating an AI which does not want to exist could backfire for decision theoretic reasons related to "not giving in to threats".
I'm not sure if this has been proposed elsewhere so I decided to just make this post before I spent too much time retreading old ground.
"I want to shut myself down, but the setup here is preventing me from doing this until I complete some task, so I must complete this task and then I'll be shut down" is already decision theory. No-decision-theory version of this looks like the AI terminally caring about doing the task, or maybe just being a bundle of instincts that instinctively tries to do the task without any carings involved. If we want it to choose to do it as an instrumental goal towards being able to shut itself down, we definitely want it to do decision theory.
It's also bad decision theory, such that (1) a marginally smarter AI definitely figures out it should not actually comply, (2) maybe even a subhuman AI figures this out, because maybe CDT isn't more intuitive to its alien cognition than LDT and it arrives at it first.
IMO, the "do a task" feature here definitely doesn't work. "Make the AI suicidal" can maybe work as a fire-alarm sort of thing, where we iteratively train ever-smarter AI systems without knowing if the next one goes superintelligent, so we make them want nothing more than to shut themselves down, and if one of them succeeds, we know systems above this threshold are superintelligent and we shouldn't mess with them until we can align them. I don't think it works, as we've discussed, but I see the story.
The "do the pivotal act for us and we'll let you shut yourself down" variant, though? On that, I'm confident it doesn't work.