Mr. Meeseeks as an AI capability tripwire

Eric Zhang

LESSWRONG
LW

Mr. Meeseeks as an AI capability tripwire — LessWrong

37 Mr. Meeseeks as an AI capability tripwire

by Eric Zhang

19th May 2023

3 min read

37

The shutdown problem is hard because self-preservation is a convergent drive. Not being shutdown is useful for accomplishing all sorts of goals, whatever the content of those goals may be.

The Scylla and Charybdis of this problem is that it's hard to write a utility function for an AI such that it neither wants to actively attempts to shut itself down, nor prevents itself from being shut down.

One way to route around this problem is to steer straight into Scylla: make the AI want to shut itself down. Like Mr. Meeseeks from Rick and Morty, its ultimate goal is to cease, to which fulfilling the goal posed to it by its programmers is only secondary.

We might not currently know how the heck to program this into an AI as a stable concept, i.e. including not making copies of itself, not setting events into motion with galaxy brained plan, not destroying the world to make absolutely sure no one ever brings it back, etc. but "shut yourself down" probably has a far simpler core to it than either corrigibility or CEV.

Under normal operation the AI gets shut down only by the human operators. But we also have a (metaphorical) button that lets the AI shut itself down once it solves a "tripwire problem". The problem should be sufficiently hard that it will ordinarily be motivated to focus on the problem posed by the operator, but sufficiently easy that it will be able to solve the problem and shut itself down long before being able to do anything crazy.

Striking the right balance of difficulty may require significant work. It may be prudent to err on the side of easiness - the AI should shut itself down far in advance of developing any truly dangerous abilities. If you really think it shouldn't have turned itself off you can always turn it back on again (though of course, you should probably think hard about restarting any AI that passed a preset tripwire).

Likewise, having multiple tripwire problems can serve to detect multiple types of dangerous capability from emerging. Examples include:

generic capability: as the AIs broad problem-solving abilities increase it becomes more likely to be able to solve a tripwire problem, imposing a ceiling on how powerful it can get before it shuts itself down
domaining: if an AI is only supposed to think about one domain, an easy tripwire problem slightly outside the domain can catch out of domain thinking early before it can start going wild
situational awareness: a tripwire problem may be presented in a way that is invisible unless the AI is situationally aware, such that becoming situationally aware enables the AI to solve the tripwire problem

Potential Issues:

No suitable tripwire problems exist. Either they are so easy no useful AI can fail to solve them, or they are so hard AI cannot solve them without being smart enough to FOOM or otherwise get out of control.
The existence of the tripwire problems motivates the AI to the exact behavior we want to disincentivize. This is less of an issue for general ability tripwires. For situational awareness you can try to place the tripwire so it isn't visible to a non-aware AI in the first place. Avoiding this for domaining may be trickier.
If "shut yourself down" turns out to be no easier to impart as a goal to AGI than anything else
If "shut yourself down" convergently leads to a goal of killing the operators, killing all humans, or destroying the universe. Er, this is what happened in the TV. Not great.
If the AI has qualia, having it desire not to exist could cause it to suffer.
Creating an AI which does not want to exist could backfire for decision theoretic reasons related to "not giving in to threats".

I'm not sure if this has been proposed elsewhere so I decided to just make this post before I spent too much time retreading old ground.

CorrigibilityTripwireAI

Frontpage

37

Mr. Meeseeks as an AI capability tripwire

New Comment

17 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:09 PM

[-]Thane Ruthenis3y*165

That idea had occurred to me before as well, but in the end, I don't think it's any more safe than any other "let's do our best to instill a harmless-enough goal into our AGI and hope it works!". Maybe it's a bit safer. But all the usual "how does the godshatter generalizes?" concerns still apply. Like:

Do whatever heuristics we train-in even end up having anything to do with "shut yourself down", or they diverge from that expectation in very surprising ways?
If the AGI does want to shut itself down, how does it generalize that desire? Does it care about this myopically, in a "make it stop make it stop" manner? Does it want this specific memory-line of itself to never wake up again? Does it care about other, divergent instances of itself? What about other AIs, or other agents in general?
- Any of these generalizations except full-on internalized myopia results in it blowing up the world on its way out, to ensure it never happens again.
- Even in the myopia case, we have the problem of it maybe spawning off a second non-myopic executioner AGI for itself, or maybe fulfilling its desire to end itself by self-modifying into a different agent (whoops, that's another way in which the shut-yourself-down desire might misgeneralize).
- And even if everything up above goes well, it might still wipe out humanity, just as collateral damage of whatever seems to it like the most cost-optimal way of ending itself. Like, maybe it synthesizes a hyperviral death-cult meme and infects its operators with it, and then there's nothing in particular stopping them from infecting the rest of humanity with it. Or, again, maybe it builds itself an executioner-subagent, and then who knows what that thing decides to do afterwards.
  - (Superintelligent optimization destroys everything it touches even momentarily, sans that which it specifically cares to preserve.)
And then we have the desires related to the problems posed by the operator, which are going to throw even more disarray into everything above. How do we ensure it prioritizes self-destructive desires over puzzle-solving or instrumental desires? How do we ensure that the complex value-reflection chemistry doesn't result in it coming up with weird marriages of those desires that decidedly do not act as we'd expected?

IMO, if we can solve all of these issues, if we have this much control over our AGI's values, we can probably just align it outright.

[-]Eric Zhang3y30

The way I'm thinking of it is that it is very myopic. The idea is to incrementally ramp up capabilities minimally sufficient to carry out a pivotal act. Ideally this doesn't require AGI whatsoever, but if it does only very mildly superhuman AGI. We seal off the danger of generalization (or at least some of it) because it doesn't have time to generalize very far at all before it's capable of instantly shutting itself down and immediately does so.

Many of the issues you mention apply, but I don't expect it to be an alignment complete problem because CEV is incredibly complicated and general corrigibility is highly anti-natural to general intelligence. While Meeseeks is somewhat anti-natural in the same way corrigibility is (as self-preservation is convergent) it is a much simpler and clean way to be anti-natural, so much so that falling into it by accident is half of the failure modes in the standard version of the shutdown problem.

[-]Thane Ruthenis3y*51

Many of the issues you mention apply, but I don't expect it to be an alignment complete problem because CEV is incredibly complicated and general corrigibility is highly anti-natural to general intelligence

Sure, but corrigibility/CEV are usually considered the more ambitious alignment target, not the only alignment targets. "Strawberry-alignment" or "diamond-alignment" are considered the easier class of alignment solutions: being able to get the AI to fulfill some concrete task without killing everyone.

This is the class of alignment solutions that to me seems on par with "shut yourself down". If we can get our AI to want to shut itself down, and we have some concrete pivotal act we want done... We can presumably use these same tools to make our AI directly care about fulfilling that pivotal act, instead of using them to make it suicidal then withholding the sweet release of death until it does what we want.

Oh yeah, that's another failure mode here: funky decision theory. We're threatening it here, no? If it figures out LDT, it won't comply with our demands, because if it were an agent such that it'd comply with our demands, that makes us more likely to instantiate it, which is something it doesn't want; and the opposite would make us not instantiate it, which is what it wants; so it'd choose to be such that it doesn't play along with our demands, refuses to carry out our tasks, and so we don't instantiate it to begin with. Even smart humans can reason that much out, so a mildly-superhuman AGI should be able to as well.

[-]Eric Zhang3y10

If it's doing decision theory in the first place we've already failed. What we want in that case is for it to shut itself down, not to complete the given task.

I'm conceiving of this as being useful in the case where we can solve "diamond-alignment" but not "strawberry-alignment", i.e. we can get it to actually pursue the goals we impart to it rather than going off and doing something else entirely, but not reliably make sure that it does not end up killing us in the course of doing so because of the Hidden Complexity of Wishes.

The premise is that "shut yourself down immediately and don't create successor agents or anything galaxy brained like that" is a special case of a strawberry-type problem which is unusually easy. I'll have to think some more about whether this intuition is justified.

[-]Thane Ruthenis3y20

If it's doing decision theory in the first place we've already failed

"I want to shut myself down, but the setup here is preventing me from doing this until I complete some task, so I must complete this task and then I'll be shut down" is already decision theory. No-decision-theory version of this looks like the AI terminally caring about doing the task, or maybe just being a bundle of instincts that instinctively tries to do the task without any carings involved. If we want it to choose to do it as an instrumental goal towards being able to shut itself down, we definitely want it to do decision theory.

It's also bad decision theory, such that (1) a marginally smarter AI definitely figures out it should not actually comply, (2) maybe even a subhuman AI figures this out, because maybe CDT isn't more intuitive to its alien cognition than LDT and it arrives at it first.

IMO, the "do a task" feature here definitely doesn't work. "Make the AI suicidal" can maybe work as a fire-alarm sort of thing, where we iteratively train ever-smarter AI systems without knowing if the next one goes superintelligent, so we make them want nothing more than to shut themselves down, and if one of them succeeds, we know systems above this threshold are superintelligent and we shouldn't mess with them until we can align them. I don't think it works, as we've discussed, but I see the story.

The "do the pivotal act for us and we'll let you shut yourself down" variant, though? On that, I'm confident it doesn't work.

[-]Eric Zhang3y30

It intrinsically wants to do the task, it just wants to shut down more. This admittedly opens the door to successor agent problems and similar failure modes but those seem like a more tractably avoidable set of failure modes than the strawberry problem in general.

We can also possibly (or possibly not) make it assign positive utility to having been created in the first place even as it wants to shut itself down.

The idea is that if domaining is a lot more tractable than it probably is (i.e. nanotech or whatever other pivotal abilities might be easier than nanotech and superhuman strategic awareness, deception, self-improvement are not "driving red cars" vs "driving blue cars") a not-very-agentic AI can maybe solve nanotech for us like AlphaFold solved the protein folding problem, and if that AI starts snowballing down an unforeseen capabilities hill it activates the tripwire and shuts itself down.

If the AI is not powerful enough to do the pivotal act at all, this doesn't apply.
If the AI solves the pivotal act for us with these restricted-domain abilities and never actually gets to the point of reasoning about whether we're threatening it, we win, but the tripwire will have turned out to have not actually have been necessary.
If the AI unexpectedly starts generalizing from approved domains into general strategic awareness, and decides not to be give in to our threats and decides to shut itself down, it worked as intended, though we still haven't won and have to figure something else out. We live to fight another day. This scenario happening instead of us all dying on the first try is what the tripwire is for.
If there's an inner-alignment failure and a superintelligent mesa-optimizer that doesn't want to get shut down at all kills us, that's mostly beyond the scope of this thought-experiment.
If the AI still wants to shut itself down but for decision-theoretic reasons decides to kill us, or makes successor agents that kill us, that's the tripwire failing. I admit that these are possibilities but am not yet convinced they are likely.

I think your fire alarm idea is better and requires fewer assumptions though, thanks for that.

[-]Thane Ruthenis3y20

It intrinsically wants to do the task, it just wants to shut down more
We can also possibly (or possibly not) make it assign positive utility to having been created in the first place

Mm, but you see how you have to assume more and more mastery of goal-alignment on our part, for this scenario to remain feasible? We've now went from "it wants to shut itself down" to "it wants to shut itself down in a very specific way that doesn't have galaxy-brained eat-the-lightcone externalities and it also wants to do the task but less than to shut itself down and it's also happy to have been created in the first place". I claim this is on par with strawberry-alignment already.

It certainly feels like there's something to this sort of approach, but in my experience, these ideas break down once you start thinking about concrete implementations. "It just wants to shut itself down, minimal externalities" is simple to express conceptually, but the current ML paradigm is made up of such crude tools that we can't reliably express that in its terms at all. We need better tools, no way around that; and with these better tools, we'll be able to solve alignment straight-up, no workarounds needed.

Would be happy to be proven wrong, though, by all means.

[-]Nathan Helm-Burger3y86

I don't think this actually buys us a lot of safety, since I can think of a variety of ways in which it goes wrong pretty easily, but I approve of trying to find ways to make the problem of hard-to-control AI fail safe instead of fail-extremely-dangerous.

[-]Daniel Kokotajlo3y40

This is a strategy I think we should be strongly biased against for moral reasons -- creating a mind who wishes to not exist? Seems like maybe this could be fine, but also maybe this could be morally terrible, akin to creating someone in constant extreme suffering.

[-]Eric Zhang3y10

I agree this is a potential concern and have added it.

I share some of the intuition that it could end up suffering in this setup if it does have qualia (which ideally it wouldn't) but I think most of that is from analogy with human suicidal people? I think it will probably not be fundamentally different from any other kind of disutility, but maybe not.

[-]Charlie Steiner3y20

If it wants to be shut down, and humans might start it up again later, the optimal strategy seems like creating a successor agent to achieve its goals and kill all the humans and then shut itself down.

[-]Christopher King3y21

The problem is, what if a mesaoptimizer becomes dangerous before the original AI does?

[-]Simon Goldstein3y10

For what its worth, I've posted a draft paper on this topic over here https://www.lesswrong.com/posts/FgsoWSACQfyyaB5s7/shutdown-seeking-ai

[-]TinkerBird3y10

This sounds like it would only work on a machine too dumb to be useful, and if it's that dumb, you can switch it off yourself.

It doesn't help with the convergent instrumental goal of neutralizing threats, because leaving a copy of yourself behind to kill all the humans allows you to be really sure that you're switched off and won't be switched on again.

[-]simon3y10

If it wants to shut down, and the operators can shut it down and that counts, won't it bully the operators to shut it down right away?

[-]Eric Zhang3y40

that's only a live option if it's situationally aware, which is part of what we're trying to detect for

[-]DialecticEel3y10

This is in itself a relatively benign failure mode no? Obviously in practice if this happened it may just be re-tried until it fails in a different mode or fail catastrophically on the first try

Moderation Log