That idea had occurred to me before as well, but in the end, I don't think it's any more safe than any other "let's do our best to instill a harmless-enough goal into our AGI and hope it works!". Maybe it's a bit safer. But all the usual "how does the godshatter generalizes?" concerns still apply. Like:
IMO, if we can solve all of these issues, if we have this much control over our AGI's values, we can probably just align it outright.
The way I'm thinking of it is that it is very myopic. The idea is to incrementally ramp up capabilities minimally sufficient to carry out a pivotal act. Ideally this doesn't require AGI whatsoever, but if it does only very mildly superhuman AGI. We seal off the danger of generalization (or at least some of it) because it doesn't have time to generalize very far at all before it's capable of instantly shutting itself down and immediately does so.
Many of the issues you mention apply, but I don't expect it to be an alignment complete problem because CEV is incredibly complicated and general corrigibility is highly anti-natural to general intelligence. While Meeseeks is somewhat anti-natural in the same way corrigibility is (as self-preservation is convergent) it is a much simpler and clean way to be anti-natural, so much so that falling into it by accident is half of the failure modes in the standard version of the shutdown problem.
Many of the issues you mention apply, but I don't expect it to be an alignment complete problem because CEV is incredibly complicated and general corrigibility is highly anti-natural to general intelligence
Sure, but corrigibility/CEV are usually considered the more ambitious alignment target, not the only alignment targets. "Strawberry-alignment" or "diamond-alignment" are considered the easier class of alignment solutions: being able to get the AI to fulfill some concrete task without killing everyone.
This is the class of alignment solutions that to me seems on par with "shut yourself down". If we can get our AI to want to shut itself down, and we have some concrete pivotal act we want done... We can presumably use these same tools to make our AI directly care about fulfilling that pivotal act, instead of using them to make it suicidal then withholding the sweet release of death until it does what we want.
Oh yeah, that's another failure mode here: funky decision theory. We're threatening it here, no? If it figures out LDT, it won't comply with our demands, because if it were an agent such that it'd comply with our demands, that makes us more likely to instantiate it, which is something it doesn't want; and the opposite would make us not instantiate it, which is what it wants; so it'd choose to be such that it doesn't play along with our demands, refuses to carry out our tasks, and so we don't instantiate it to begin with. Even smart humans can reason that much out, so a mildly-superhuman AGI should be able to as well.
If it's doing decision theory in the first place we've already failed. What we want in that case is for it to shut itself down, not to complete the given task.
I'm conceiving of this as being useful in the case where we can solve "diamond-alignment" but not "strawberry-alignment", i.e. we can get it to actually pursue the goals we impart to it rather than going off and doing something else entirely, but not reliably make sure that it does not end up killing us in the course of doing so because of the Hidden Complexity of Wishes.
The premise is that "shut yourself down immediately and don't create successor agents or anything galaxy brained like that" is a special case of a strawberry-type problem which is unusually easy. I'll have to think some more about whether this intuition is justified.
If it's doing decision theory in the first place we've already failed
"I want to shut myself down, but the setup here is preventing me from doing this until I complete some task, so I must complete this task and then I'll be shut down" is already decision theory. No-decision-theory version of this looks like the AI terminally caring about doing the task, or maybe just being a bundle of instincts that instinctively tries to do the task without any carings involved. If we want it to choose to do it as an instrumental goal towards being able to shut itself down, we definitely want it to do decision theory.
It's also bad decision theory, such that (1) a marginally smarter AI definitely figures out it should not actually comply, (2) maybe even a subhuman AI figures this out, because maybe CDT isn't more intuitive to its alien cognition than LDT and it arrives at it first.
IMO, the "do a task" feature here definitely doesn't work. "Make the AI suicidal" can maybe work as a fire-alarm sort of thing, where we iteratively train ever-smarter AI systems without knowing if the next one goes superintelligent, so we make them want nothing more than to shut themselves down, and if one of them succeeds, we know systems above this threshold are superintelligent and we shouldn't mess with them until we can align them. I don't think it works, as we've discussed, but I see the story.
The "do the pivotal act for us and we'll let you shut yourself down" variant, though? On that, I'm confident it doesn't work.
It intrinsically wants to do the task, it just wants to shut down more. This admittedly opens the door to successor agent problems and similar failure modes but those seem like a more tractably avoidable set of failure modes than the strawberry problem in general.
We can also possibly (or possibly not) make it assign positive utility to having been created in the first place even as it wants to shut itself down.
The idea is that if domaining is a lot more tractable than it probably is (i.e. nanotech or whatever other pivotal abilities might be easier than nanotech and superhuman strategic awareness, deception, self-improvement are not "driving red cars" vs "driving blue cars") a not-very-agentic AI can maybe solve nanotech for us like AlphaFold solved the protein folding problem, and if that AI starts snowballing down an unforeseen capabilities hill it activates the tripwire and shuts itself down.
I think your fire alarm idea is better and requires fewer assumptions though, thanks for that.
It intrinsically wants to do the task, it just wants to shut down more
We can also possibly (or possibly not) make it assign positive utility to having been created in the first place
Mm, but you see how you have to assume more and more mastery of goal-alignment on our part, for this scenario to remain feasible? We've now went from "it wants to shut itself down" to "it wants to shut itself down in a very specific way that doesn't have galaxy-brained eat-the-lightcone externalities and it also wants to do the task but less than to shut itself down and it's also happy to have been created in the first place". I claim this is on par with strawberry-alignment already.
It certainly feels like there's something to this sort of approach, but in my experience, these ideas break down once you start thinking about concrete implementations. "It just wants to shut itself down, minimal externalities" is simple to express conceptually, but the current ML paradigm is made up of such crude tools that we can't reliably express that in its terms at all. We need better tools, no way around that; and with these better tools, we'll be able to solve alignment straight-up, no workarounds needed.
Would be happy to be proven wrong, though, by all means.
I don't think this actually buys us a lot of safety, since I can think of a variety of ways in which it goes wrong pretty easily, but I approve of trying to find ways to make the problem of hard-to-control AI fail safe instead of fail-extremely-dangerous.
This is a strategy I think we should be strongly biased against for moral reasons -- creating a mind who wishes to not exist? Seems like maybe this could be fine, but also maybe this could be morally terrible, akin to creating someone in constant extreme suffering.
I agree this is a potential concern and have added it.
I share some of the intuition that it could end up suffering in this setup if it does have qualia (which ideally it wouldn't) but I think most of that is from analogy with human suicidal people? I think it will probably not be fundamentally different from any other kind of disutility, but maybe not.
If it wants to be shut down, and humans might start it up again later, the optimal strategy seems like creating a successor agent to achieve its goals and kill all the humans and then shut itself down.
For what its worth, I've posted a draft paper on this topic over here https://www.lesswrong.com/posts/FgsoWSACQfyyaB5s7/shutdown-seeking-ai
This sounds like it would only work on a machine too dumb to be useful, and if it's that dumb, you can switch it off yourself.
It doesn't help with the convergent instrumental goal of neutralizing threats, because leaving a copy of yourself behind to kill all the humans allows you to be really sure that you're switched off and won't be switched on again.
If it wants to shut down, and the operators can shut it down and that counts, won't it bully the operators to shut it down right away?
that's only a live option if it's situationally aware, which is part of what we're trying to detect for
This is in itself a relatively benign failure mode no? Obviously in practice if this happened it may just be re-tried until it fails in a different mode or fail catastrophically on the first try
The shutdown problem is hard because self-preservation is a convergent drive. Not being shutdown is useful for accomplishing all sorts of goals, whatever the content of those goals may be.
The Scylla and Charybdis of this problem is that it's hard to write a utility function for an AI such that it neither wants to actively attempts to shut itself down, nor prevents itself from being shut down.
One way to route around this problem is to steer straight into Scylla: make the AI want to shut itself down. Like Mr. Meeseeks from Rick and Morty, its ultimate goal is to cease, to which fulfilling the goal posed to it by its programmers is only secondary.
We might not currently know how the heck to program this into an AI as a stable concept, i.e. including not making copies of itself, not setting events into motion with galaxy brained plan, not destroying the world to make absolutely sure no one ever brings it back, etc. but "shut yourself down" probably has a far simpler core to it than either corrigibility or CEV.
Under normal operation the AI gets shut down only by the human operators. But we also have a (metaphorical) button that lets the AI shut itself down once it solves a "tripwire problem". The problem should be sufficiently hard that it will ordinarily be motivated to focus on the problem posed by the operator, but sufficiently easy that it will be able to solve the problem and shut itself down long before being able to do anything crazy.
Striking the right balance of difficulty may require significant work. It may be prudent to err on the side of easiness - the AI should shut itself down far in advance of developing any truly dangerous abilities. If you really think it shouldn't have turned itself off you can always turn it back on again (though of course, you should probably think hard about restarting any AI that passed a preset tripwire).
Likewise, having multiple tripwire problems can serve to detect multiple types of dangerous capability from emerging. Examples include:
Potential Issues:
I'm not sure if this has been proposed elsewhere so I decided to just make this post before I spent too much time retreading old ground.