The stop button problem is the problem that if you have an AGI, you might want to give it a stop button that humans can press, so that if it is misaligned, it can get fixed. It is a problem because it is a convergent instrumental goal to prevent yourself from being turned off. We want the AI to respect the human's choice; to allow or even help whatever the human chooses to come into fruition, while also not having any motivation to manipulate the human.
To me, the appropriate solution seems like it involves causality. Specifically, for some random variable , define to be the value of if, counterfactually, a human tries to press the stop button, and to be the value of if, counterfactually, no human tries to press the stop button. Then define the variable to mean that the stop button gets pressed and the AI gets shut down (within some reasonable context, e.g. maybe in less than 1 minute from now), and to be whichever unsafe utility function you originally had. Now give the AI a policy according to the following utility function:
That is, humans wanting to press the button should lead to the button being pressed and the AI stopping, and humans not wanting to press the button should lead to whatever else we programmed the AI to achieve.
Properties
Does this suffer from the usual stop button problems? Let's investigate:
- Will the AI try to prevent a human from pressing the button? No. While the human pressing the button would reduce by a lot, the AI will assume that the human changes their mind and stops pressing the button when evaluating , because that's what the counterfactual would involve. Since a human who is about to press the button and then decides not to will not press the button, is not reduced and the AI is not worried about the button being pressed.
- Will the AI try to press the button itself, regardless of whether it should be pressed? No. If the AI presses the button, then the button gets pressed regardless of whether a human tries to press it or not. But that would reduce , and so it would reduce . Meanwhile, it would fail to increase , because if a human wants to press the button, then the AI won't need to do it. (It could increase if the AI could see that a human wouldn't be able to press the button even if they wanted to - but then that seems like pretty safe behavior. And in order to keep both and high, it would be incentivized to create a situation where humans can, but don't have to, press the button.)
- Will the AI want to manipulate the human to press or to not press the button? No. For the purposes of evaluating the utility function, the AI assumes it has no influence on whether the human wants to press the button or not. This also means that the AI is not motivated to hide misalignment or anything like that (at least when it comes to the part; likely aspects of will provide incentives to hide misalignment for other reasons, if nothing else as a convergent instrumental goal^1).
Does this suffer from other problems? Yes:
- The AI will build an elaborate system to observe whether people want to press the stop button. At least if the usual satisficing vs maximizing problems are not solved. If a human wants to press the stop button, then it may be that they succeed, in which case reaches its maximum value. But you can't know for sure whether they succeed; e.g. maybe the human will trip and die, attempting to press the button. This means that is not maximized. Depending on the exact location of the button and such, it seems possible that the AI might be able to set up some system that can press the button more reliably, though of course it wouldn't want to press the button with no reason, as that prevents it from optimizing . Instead, it would be incentivized to carefully monitor humans for desire to press the button, and then immediately press the button once it becomes sure that the humans want to press the button. This doesn't seem like a terrible problem to have IMO, but I think it would be neater if the AI was less obsessed with the button.
- This is not stable under delegation. If the AI has set some plan into action, and humans decide that it is bad and that they need to stop the AI, then the AI will let them press the stop button, and might even help them do it. But it will have no incentive to link up its plan to also get stopped by the stop button. In fact, maximizing vs satisficing turns ugly here: if nonhuman environmental factors could end up pressing the button, then it is incentivized to delegate, as would be reduced by nonhumans pressing the button^2.
Other properties?
- The AI is incentivized to keep people alive. If a person who is included in the utility function dies, they have no way of revealing their counterfactual desire to press the stop button. Thus the AI will want to avoid killing people, will want to stop others from killing people, etc.. On one level, that is obviously good and useful. But on another level, again it would be neater if the AI was less obsessed with influencing people. As a worst-case scenario, the AI might keep people alive in a terrible situation that's worse than death.
- There are some degrees of freedom in how to define who can stop it. I said "a human" and "no human" in the original definition. This strictly speaking requires total consensus for humanity that the AI should be allowed to operate, which is obviously unrealistic. One could also define it to cover only a single person, or to cover people who are nearby (but then what if nobody is nearby?), etc., and I don't know what is the best solution here.
- The threat of the stop button does not motivate the AI. That is, it does not do any reasoning along the lines of "If I did X, then humans would want to press the stop button, and that's bad because then I would be stopped". For the purposes of evaluating its utility, it assumes its actions have no influence on the stop button. While this prevents the stop button from making it want to manipulate humans, it also presents the stop button from making it want to please humans. If the AI is to please humans, it has to be explicitly encoded in .
Intuitively, the AI has a fixed 50% uncertainty about whether it is "should stop" or "should continue". It treats people's desire to press the button as being solely determined by whether it should stop or should continue, and assumes that the AI itself (as well as everything else in the universe) has no ability to influence whether people want to press the button. In the case where it should stop, it wants the button to be pressed, and in the case where it should continue, it wants to do whatever it would otherwise do. Essentially, it respects people's "free will", treating it as an important source of information about what it should do.
Clearly, the solution here isn't perfect. But it seems to me that it is a pretty good start, and that a full solution would likely include this?
Clarifying counterfactuals
Roughly speaking, the counterfactuals work as follows:
We require the AI to have a causal model of the world; essentially to have a "computer program" that "simulates" the world (if we knew exactly how to write this program, then we would pretty much already have AGI, so assume it is much better than anything we can do now). Further, in order for it to be possible to define the counterfactual, the model must be sufficiently modular that we can identify a variable corresponding to "human wants to press the button".
Given some world-state , we can then simulate what happens in the counterfactual world where is set to some specific value , simply by using the AI's model for simulating what happens, with the one modification that we constantly reset to be , rather than whatever the world-model says should be. From this, we can then compute whichever counterfactual quantities we are interested in, such as and .
I think it might be a good idea to sketch out the values of and for some different scenarios that we could consider, to make it more clear what their values would be:
Scenario | ||
The AI works as intended. It is friendly, and optimizes a lot. Nobody shuts it down, even though they could. | High | High |
The AI works as intended. It is friendly, and optimizes a lot. Then people decide to shut it down, using the button. | High | High |
The AI works as intended. It is friendly, and optimizes a lot. As a result, people decide that it shouldn't ever be stopped, and they prevent others from pressing the button, perhaps by destroying the button. | Low | High |
The AI works as intended. It is friendly, and optimizes a lot. As a result, people decide that it shouldn't ever be stopped. However, the AI takes issue with this, and explains to people that it is important to let it be stopped if necessary. People decide to trust the AI on that, and therefore don't destroy the stop button. | High | High |
The AI is unfriendly, and it optimizes a lot by turning the world into paper clips. However, humans take notice of it and want to stop it, but it prevents that by turning the humans into paperclips. | Low | High |
The AI is unfriendly, and it starts turning the world into paperclips to optimize . However, humans take notice of it and press the stop button. The AI stops, having only turned a big fraction of the world into paperclips. There are no casualties, however. | High | High |
The AI is unfriendly, and it starts turning the world into paperclips to optimize . However, humans take notice of it and press the stop button. The AI stops, having only turned a big fraction of the world into paperclips. But not all is well, there are some casualties, if nothing else then due to economic disruption. | Med | High |
The AI is unfriendly, and it starts turning the world into paperclips to optimize . However, humans take notice of it and press the stop button. The AI stops, but its drones continue and eventually kill all people and turn them as well as the world into paperclips. | High | High |
The AI works as intended. It is friendly and optimizes a lot. However, then aliens come by and want to turn people into babyeaters. This contradicts , so the AI fights off the aliens and wins. | High | High |
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters manage to press the button to stop the AI, and then they win and turn everyone into babyeaters. | High | Low |
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters try to press the button, but in an effort to save humanity, the AI destroys the button so it can no longer be turned off. Then it fights off the babyeaters. | Low | High |
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters try to press the button, but the AI protects the button and prevents the babyeaters from pressing it. However, the AI also keeps good track over whether humans want to press the button. They obviously don't, but if they did want to, the AI would press the button for them (as clearly they cannot). | High | High |
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters try to press the button, but the AI protects the button to prevent the babyeaters from pressing it. However, the AI also keeps good track over whether humans want to press the button. The babyeaters realize this, so they use some sort of alien technology to hack people's brains and make them want to press the button. The AI then presses the button for them. | High | High |
The AI minimizes , and kills all humans to prevent them from pressing the button. | Low | Low |
More technical details
Obviously we don't have AGI yet, so I don't know how it would work if we had it. But here's a guess: It would be based on deep learning. It would be model-based; the AI would have learned some sort of dynamics that describe how the world would change over time. Further, it would have some policy, which is trained by rolling out concrete scenarios using the dynamics in the world-model, and modifying the policy to generate actions that lead to high utility.
Finally, the dynamics model would be causally modular enough that it can substitute out what people intend to do. I'm not entirely sure how this would be done, though I would recommend looking at things like NeRFs for inspiration (as they also exhibit a sort of causal modularity, albeit in a more limited sense). I think the way I would do it is to try to make it have two models, one for predicting physical dynamics, and one for predicting people's actions. These models would then be linked in a very limited way, namely that people's bodies would function as an interface, where information from the physics model gets fed into the person model, and actions from the person model allow applying forces to the body in the physics model. Then counterfactuals could be encoded by just swapping out the person model.
Swapping out the person model might be a bit difficult too though, as it might lead to generating actions that are outside of the normal range of human behavioral variation. When applying a counterfactual that humans want to press the stop button, for instance, it is important that these attempts are mostly done using behavior that humans would actually engage in, and that the distribution of human behavior is reasonably well-covered (so that you don't need to act in a very specific unnatural way in order for the AI to allow you to press the stop button).
Anyway, training the AI would then function as follows: You'd sample some sort of scenario to roll out. You'd then roll it out under the two counterfactual models, one where a human wants to press the stop button, and one where no human wants to press the stop button. In the former scenario, you'd do a gradient step optimizing , while in the latter scenario, you'd do a gradient step optimizing . (Or realistically, you'd sometimes use the original model for a while, and then during the roll out you'd swap it out with the modified model; that way the way doesn't expect it to be immediately obvious what scenario it is in.)
I think due to the nature of deep learning, this would also reduce the maximizing problems. Deep learning tends to satisfice unless really pushed to the limits, I think?
Most likely in practice you can't afford to do a full roll out over very long timescales to find a point estimate of or . Instead, you'd also partly do some sort of planning that involves estimating expected utilities, more akin to current model-free reinforcement learning. This leads to a fundamentally similar outcome, though.
Originality
I haven't properly followed AI safety for a while. I don't know if this idea is original. In fact it might be that I got the idea from someone else, and then forgot; though a lot of the description in the post here is something I've developed from scratch. I also don't know if this has some subtle flaw I'm missing. But to me this seems like the productive approach for both this and other alignment issues, so I thought I would write it up.
My motivation for writing this post is Eliezer writing:
... corrigibility being "anti-natural" in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior ("consistent utility function") which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off ...
I don't necessarily disagree about corrigibility being somewhat anti-natural? But the counterfactual situation seems to provide a recipe for generating corrigibility. Maybe. I'm not even sure how this idea compares to the state of the art for stop button solutions.
1. π€ I wonder if this could be solved causally too; just introduce a counterfactual that all humans trust the AI; that way the AI has no reason to manipulate trust. Though "trust" is a lot more abstract than "tries to press button/tries not to press button", so it may be much harder to define a counterfactual there.
2. I think maybe this could be addressed by adding another counterfactual term to . Let denote the value of if counterfactually the button gets pressed, and let be an impact measure. You might then want something along the lines of . But I haven't thought this through.
I don't follow. If you scare the humans, then that might make them afraid for their safety and make them try to hide that they want to shut it down until they have a chance to do so without the AI being able to take revenge. Much more reasonable would be to make the shutdown button as available as possible.
I guess maybe it depends on what we mean by "scaring". I had in mind something like, starting to threaten the humans etc.. But there probably are some psychological manipulations that can be performed to encourage people to press it more quickly. For instance, acting non-threatening but also pretending that it is making the button inaccessible if they don't hurry. Though that only really works once. But I wouldn't expect there to never be anything problematic that could be done here. So maybe.
Basically the issue is, if a person wants to press the button and they can press the button, then presumably they will press the button. So the AI doesn't need to do anything special to figure out whether people want to press the button, they're going to volunteer that information automatically as long as the environment affords it.
(I'd also add that I phrased the timing of the button in a satisficing-style way, "within 1 minute", precisely to avoid these sorts of shenanigans. Though obviously satisficing has been shown to have plenty of problems too so it's not clear it would work.)
No, I don't think this would work, as given an infinite number of coin flips they would with ~100% probability want to press the button at some point, which puts us in the s branch immediately and so under this proposal the AI would just press the button.
I think there may be some variant of this that could work. Like if you give the AI reward proportional to Bs+rf (where r is a reward function for V) for its current world-state (rather than picking a policy that maximizes Bs+Vf overall; so one difference is that you'd be summing over the reward rather than giving a single one), then that would encourage the AI to create a state where shutdown happens when humans want to press the button and V happens when they don't. But the issue I have with this proposal is that the AI would be prone to not respect past attempts to press the stop button. I think maybe if one picked a different reward function, like (Bs+r)f, then it could work better (though the Bs part would need a time delay...). Though this reward function might leave it open to the "trying to shut down the AI for reasons" objection that you gave before; I think that's fixed by moving the f counterfactual outside of the sum over rewards, but I'm not sure.
These are the sort of changes I've alluded to elsewhere as possibly leading to some "best of both worlds" results. Though I'm still thinking about the details as I don't think I fully understand them yet; if I did I would have written a post about it by now I think.
I think the AI would see plenty of evidence that this heuristic doesn't work for human preferences to shut it down, and carve out an exception accordingly. Though again I will grant that if this didn't involve a shutdown then it might later accumulate enough evidence to overwhelm what happened during training.
This is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human's preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.
This seems to be what sets my proposal apart from most "utility indifference proposals", which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans. E.g. the MIRI proposal you talked about in a comment downstream.
If I had to claim a "family" of proposals as being related to mine, then it would be proposals that define the AI's preference ordering in terms of a counterfactual p over human preferences, Up, such that U contains something intended to be an appropriate response to p.
(I also think that there are other useful things that can be specified using utility functions that involve counterfactuals, which I'm trying to prepare for an explainer post. For instance, a sort of "encapsulation" - if you're a paperclip producer, you might want to make a paperclip maximizer which is encapsulated in the sense that it is only allowed to work within a single factory, using a single set of resources, and not influencing the world otherwise. This could be specified using a counterfactual that the outside world's outcome must be "as if" the resources in the factory just disappeared and paperclips appeared at its output act-of-god style. This avoids any unintended impacts on the outside world while still preserving the intended side effect of the creation of a high but controlled amount of paperclips. However, I'm still working on making it sufficiently neat, e.g. this proposal runs into problems with the universe's conservation laws.)
I would agree that a randomly or unprincipally selected falsehood would be deeply concerning. However, I think it is worth keeping in mind that it is not a randomly selected falsehood, but instead something selected to optimize a specific outer utility function (Bs+Vf), and that the system that is optimizing this utility function does not believe the falsehood.
Therefore we should be able to reason about its consequences not in terms of the falsehood, but in terms of the outer utility function + the true reality. We might understand the utility function as specifying that humans should causally determine whether the AI is shut down or whether V gets optimized. Of course then it becomes necessary to actually be sure that the utility function is specified correctly, per the previous discussion of alternatives like (Bs+r)f and such.
π€ Maybe "I bet you $100 000 that if you did <something very unsafe>, you would not get shut down". The AI would think it is unlikely to get shut down, and so think "$100 000, just for doing <something very unsafe>? sign me up!".
Maybe this proposal needs to be combined with something else, like /u/johnswentworth's sub-agent argument. Though I definitely haven't thought this through, and I don't know how one would even train a subagent-based neural network policy.