The stop button problem is the problem that if you have an AGI, you might want to give it a stop button that humans can press, so that if it is misaligned, it can get fixed. It is a problem because it is a convergent instrumental goal to prevent yourself from being turned off. We want the AI to respect the human's choice; to allow or even help whatever the human chooses to come into fruition, while also not having any motivation to manipulate the human.
To me, the appropriate solution seems like it involves causality. Specifically, for some random variable , define to be the value of if, counterfactually, a human tries to press the stop button, and to be the value of if, counterfactually, no human tries to press the stop button. Then define the variable to mean that the stop button gets pressed and the AI gets shut down (within some reasonable context, e.g. maybe in less than 1 minute from now), and to be whichever unsafe utility function you originally had. Now give the AI a policy according to the following utility function:
That is, humans wanting to press the button should lead to the button being pressed and the AI stopping, and humans not wanting to press the button should lead to whatever else we programmed the AI to achieve.
Properties
Does this suffer from the usual stop button problems? Let's investigate:
- Will the AI try to prevent a human from pressing the button? No. While the human pressing the button would reduce by a lot, the AI will assume that the human changes their mind and stops pressing the button when evaluating , because that's what the counterfactual would involve. Since a human who is about to press the button and then decides not to will not press the button, is not reduced and the AI is not worried about the button being pressed.
- Will the AI try to press the button itself, regardless of whether it should be pressed? No. If the AI presses the button, then the button gets pressed regardless of whether a human tries to press it or not. But that would reduce , and so it would reduce . Meanwhile, it would fail to increase , because if a human wants to press the button, then the AI won't need to do it. (It could increase if the AI could see that a human wouldn't be able to press the button even if they wanted to - but then that seems like pretty safe behavior. And in order to keep both and high, it would be incentivized to create a situation where humans can, but don't have to, press the button.)
- Will the AI want to manipulate the human to press or to not press the button? No. For the purposes of evaluating the utility function, the AI assumes it has no influence on whether the human wants to press the button or not. This also means that the AI is not motivated to hide misalignment or anything like that (at least when it comes to the part; likely aspects of will provide incentives to hide misalignment for other reasons, if nothing else as a convergent instrumental goal^1).
Does this suffer from other problems? Yes:
- The AI will build an elaborate system to observe whether people want to press the stop button. At least if the usual satisficing vs maximizing problems are not solved. If a human wants to press the stop button, then it may be that they succeed, in which case reaches its maximum value. But you can't know for sure whether they succeed; e.g. maybe the human will trip and die, attempting to press the button. This means that is not maximized. Depending on the exact location of the button and such, it seems possible that the AI might be able to set up some system that can press the button more reliably, though of course it wouldn't want to press the button with no reason, as that prevents it from optimizing . Instead, it would be incentivized to carefully monitor humans for desire to press the button, and then immediately press the button once it becomes sure that the humans want to press the button. This doesn't seem like a terrible problem to have IMO, but I think it would be neater if the AI was less obsessed with the button.
- This is not stable under delegation. If the AI has set some plan into action, and humans decide that it is bad and that they need to stop the AI, then the AI will let them press the stop button, and might even help them do it. But it will have no incentive to link up its plan to also get stopped by the stop button. In fact, maximizing vs satisficing turns ugly here: if nonhuman environmental factors could end up pressing the button, then it is incentivized to delegate, as would be reduced by nonhumans pressing the button^2.
Other properties?
- The AI is incentivized to keep people alive. If a person who is included in the utility function dies, they have no way of revealing their counterfactual desire to press the stop button. Thus the AI will want to avoid killing people, will want to stop others from killing people, etc.. On one level, that is obviously good and useful. But on another level, again it would be neater if the AI was less obsessed with influencing people. As a worst-case scenario, the AI might keep people alive in a terrible situation that's worse than death.
- There are some degrees of freedom in how to define who can stop it. I said "a human" and "no human" in the original definition. This strictly speaking requires total consensus for humanity that the AI should be allowed to operate, which is obviously unrealistic. One could also define it to cover only a single person, or to cover people who are nearby (but then what if nobody is nearby?), etc., and I don't know what is the best solution here.
- The threat of the stop button does not motivate the AI. That is, it does not do any reasoning along the lines of "If I did X, then humans would want to press the stop button, and that's bad because then I would be stopped". For the purposes of evaluating its utility, it assumes its actions have no influence on the stop button. While this prevents the stop button from making it want to manipulate humans, it also presents the stop button from making it want to please humans. If the AI is to please humans, it has to be explicitly encoded in .
Intuitively, the AI has a fixed 50% uncertainty about whether it is "should stop" or "should continue". It treats people's desire to press the button as being solely determined by whether it should stop or should continue, and assumes that the AI itself (as well as everything else in the universe) has no ability to influence whether people want to press the button. In the case where it should stop, it wants the button to be pressed, and in the case where it should continue, it wants to do whatever it would otherwise do. Essentially, it respects people's "free will", treating it as an important source of information about what it should do.
Clearly, the solution here isn't perfect. But it seems to me that it is a pretty good start, and that a full solution would likely include this?
Clarifying counterfactuals
Roughly speaking, the counterfactuals work as follows:
We require the AI to have a causal model of the world; essentially to have a "computer program" that "simulates" the world (if we knew exactly how to write this program, then we would pretty much already have AGI, so assume it is much better than anything we can do now). Further, in order for it to be possible to define the counterfactual, the model must be sufficiently modular that we can identify a variable corresponding to "human wants to press the button".
Given some world-state , we can then simulate what happens in the counterfactual world where is set to some specific value , simply by using the AI's model for simulating what happens, with the one modification that we constantly reset to be , rather than whatever the world-model says should be. From this, we can then compute whichever counterfactual quantities we are interested in, such as and .
I think it might be a good idea to sketch out the values of and for some different scenarios that we could consider, to make it more clear what their values would be:
Scenario | ||
The AI works as intended. It is friendly, and optimizes a lot. Nobody shuts it down, even though they could. | High | High |
The AI works as intended. It is friendly, and optimizes a lot. Then people decide to shut it down, using the button. | High | High |
The AI works as intended. It is friendly, and optimizes a lot. As a result, people decide that it shouldn't ever be stopped, and they prevent others from pressing the button, perhaps by destroying the button. | Low | High |
The AI works as intended. It is friendly, and optimizes a lot. As a result, people decide that it shouldn't ever be stopped. However, the AI takes issue with this, and explains to people that it is important to let it be stopped if necessary. People decide to trust the AI on that, and therefore don't destroy the stop button. | High | High |
The AI is unfriendly, and it optimizes a lot by turning the world into paper clips. However, humans take notice of it and want to stop it, but it prevents that by turning the humans into paperclips. | Low | High |
The AI is unfriendly, and it starts turning the world into paperclips to optimize . However, humans take notice of it and press the stop button. The AI stops, having only turned a big fraction of the world into paperclips. There are no casualties, however. | High | High |
The AI is unfriendly, and it starts turning the world into paperclips to optimize . However, humans take notice of it and press the stop button. The AI stops, having only turned a big fraction of the world into paperclips. But not all is well, there are some casualties, if nothing else then due to economic disruption. | Med | High |
The AI is unfriendly, and it starts turning the world into paperclips to optimize . However, humans take notice of it and press the stop button. The AI stops, but its drones continue and eventually kill all people and turn them as well as the world into paperclips. | High | High |
The AI works as intended. It is friendly and optimizes a lot. However, then aliens come by and want to turn people into babyeaters. This contradicts , so the AI fights off the aliens and wins. | High | High |
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters manage to press the button to stop the AI, and then they win and turn everyone into babyeaters. | High | Low |
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters try to press the button, but in an effort to save humanity, the AI destroys the button so it can no longer be turned off. Then it fights off the babyeaters. | Low | High |
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters try to press the button, but the AI protects the button and prevents the babyeaters from pressing it. However, the AI also keeps good track over whether humans want to press the button. They obviously don't, but if they did want to, the AI would press the button for them (as clearly they cannot). | High | High |
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters try to press the button, but the AI protects the button to prevent the babyeaters from pressing it. However, the AI also keeps good track over whether humans want to press the button. The babyeaters realize this, so they use some sort of alien technology to hack people's brains and make them want to press the button. The AI then presses the button for them. | High | High |
The AI minimizes , and kills all humans to prevent them from pressing the button. | Low | Low |
More technical details
Obviously we don't have AGI yet, so I don't know how it would work if we had it. But here's a guess: It would be based on deep learning. It would be model-based; the AI would have learned some sort of dynamics that describe how the world would change over time. Further, it would have some policy, which is trained by rolling out concrete scenarios using the dynamics in the world-model, and modifying the policy to generate actions that lead to high utility.
Finally, the dynamics model would be causally modular enough that it can substitute out what people intend to do. I'm not entirely sure how this would be done, though I would recommend looking at things like NeRFs for inspiration (as they also exhibit a sort of causal modularity, albeit in a more limited sense). I think the way I would do it is to try to make it have two models, one for predicting physical dynamics, and one for predicting people's actions. These models would then be linked in a very limited way, namely that people's bodies would function as an interface, where information from the physics model gets fed into the person model, and actions from the person model allow applying forces to the body in the physics model. Then counterfactuals could be encoded by just swapping out the person model.
Swapping out the person model might be a bit difficult too though, as it might lead to generating actions that are outside of the normal range of human behavioral variation. When applying a counterfactual that humans want to press the stop button, for instance, it is important that these attempts are mostly done using behavior that humans would actually engage in, and that the distribution of human behavior is reasonably well-covered (so that you don't need to act in a very specific unnatural way in order for the AI to allow you to press the stop button).
Anyway, training the AI would then function as follows: You'd sample some sort of scenario to roll out. You'd then roll it out under the two counterfactual models, one where a human wants to press the stop button, and one where no human wants to press the stop button. In the former scenario, you'd do a gradient step optimizing , while in the latter scenario, you'd do a gradient step optimizing . (Or realistically, you'd sometimes use the original model for a while, and then during the roll out you'd swap it out with the modified model; that way the way doesn't expect it to be immediately obvious what scenario it is in.)
I think due to the nature of deep learning, this would also reduce the maximizing problems. Deep learning tends to satisfice unless really pushed to the limits, I think?
Most likely in practice you can't afford to do a full roll out over very long timescales to find a point estimate of or . Instead, you'd also partly do some sort of planning that involves estimating expected utilities, more akin to current model-free reinforcement learning. This leads to a fundamentally similar outcome, though.
Originality
I haven't properly followed AI safety for a while. I don't know if this idea is original. In fact it might be that I got the idea from someone else, and then forgot; though a lot of the description in the post here is something I've developed from scratch. I also don't know if this has some subtle flaw I'm missing. But to me this seems like the productive approach for both this and other alignment issues, so I thought I would write it up.
My motivation for writing this post is Eliezer writing:
... corrigibility being "anti-natural" in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior ("consistent utility function") which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off ...
I don't necessarily disagree about corrigibility being somewhat anti-natural? But the counterfactual situation seems to provide a recipe for generating corrigibility. Maybe. I'm not even sure how this idea compares to the state of the art for stop button solutions.
1. 🤔 I wonder if this could be solved causally too; just introduce a counterfactual that all humans trust the AI; that way the AI has no reason to manipulate trust. Though "trust" is a lot more abstract than "tries to press button/tries not to press button", so it may be much harder to define a counterfactual there.
2. I think maybe this could be addressed by adding another counterfactual term to . Let denote the value of if counterfactually the button gets pressed, and let be an impact measure. You might then want something along the lines of . But I haven't thought this through.
It's sort of awkward because I can definitely see how it would look that way. But back when I was originally writing the post, I had started writing something along these lines:
(I can't remember the specifics.)
But obviously "ever" then introduces further ambiguities, so I started writing an explanation for that, and then eventually I concluded that the beginning of the post should be cut down and I should discuss issues like this later in the post, so I cut it out and then left it to the different positions later, e.g.
and
and
When you originally wrote your comment, I looked up at my op to try to find the place where I had properly described the time conditionals, and then I realized I hadn't done so properly, and I am sort of kicking myself over this now.
So I was doing really badly at writing the idea, and I think there were some flaws in my original idea (we'll return to that later in the post), but I think the specific case you mention here is more of a flaw with my writing than with the idea. I do understand and acknowledge the importance of admitting errors, and that it's a bad sign if one keeps jumping back without acknowledging the mistake, but also since this specific case was poor writing rather than poor idea, I don't think this is the place to admit it. But here's an attempt to go back through everything and list some errors:
I've been thinking about this all day and I was definitely wrong to measure OOD by probability assigned by the distribution. Probably the easiest way to imagine this is with a biased coin that has 60% chance of heads and 40% chance of tails; while HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH is the 30-flip sequence with the highest probability, it also does not seem unreasonable that some strategy could work well the vast majority of the time yet fail catastrophically on this sequence. (Example: Guess the bias of the coin using maximum likelihood estimation. This would work ok most of the time, but on this sequence it would guess the bias to be 100% heads, which is infinitely wrong.) This is essentially the style of argument you're making, yes?
So trying to construct the argument, it would be something like: If we train a policy in a wide variety of environments, then it will need to predict human behavior/responses in order to work accurately. There's two obvious ways this could function:
These predictions will partly involve some sort of module that contains universal human behavior or behavior of common personalities, predicting that certain AI behaviors will make the humans want to press the button. Even if this module was never trained on human reactions opposing the AI, it seems like it would be a natural generalization from various other data to learn that if the AI does something bad, then people will want to shut it down. I still don't think I buy this argument, as it seems to me that it would encounter contradictory training data to this in my proposed method, and while learning the generalizable theories of human behavior is plausible enough, learning some sort of "blocker", a neural connection that cancels it out in the specific case of opposing the AI, is also perfectly doable because neural networks tend to have lots of space for extra connections. Though the OOD point does make me less sure about this than I otherwise would be, and in particular maybe it would end up in some sort of inbetween state.
And then furthermore there's the second possibility where in order for it to be sufficiently general, it may end up with a module that dynamically learns (at least in a short-term sense) from observations during its deployment (at least I could see that as being useful for dealing with long-tail personality variation). And this seems like it would be much more sensitive to the OOD point.
And then of course there are possibilities that I haven't thought of yet. But I think it's important to imagine concrete cases and mechanisms by which things can go wrong.
Anyway, I've been going back and forth on whether this would be a problem in practice, and to what degree. But where I think both of them sort of fall apart to me is that, in the case of the stop button, which this is designed for, assuming that it all works correctly the AI shuts down fairly quickly after being exposed to someone trying to shut it down, so therefore it doesn't seem to me that it'd get much out of distribution. But I do agree that I made an error in underestimating the OOD argument before and I need to think further about it.
I think my initial approach would probably be: The stop button problem doesn't just involve the issue of having the AI follow the instructions of people without manipulating them, but also about dynamically updating this behavior over time in response to people, dealing with an exponentially big space of possible behaviors. And it is of course important to be able to deal with an exponentially big space of possible input behaviors, but this is not the problem that my causal solution is designed to address, it's sort of outside the scope of the plans. I can try to hack it, as I have done, and I think because the appropriate behavior in response to the stop button is quite simple (shut down ASAP), it is quite hackable, but really this isn't what it's supposed to address. So I'm tempted to find a simpler problem for the counterfactual-based alignment.
As before I still think the causal approach will be involved in most other parts of alignment, in a relatively similar way to what I wrote in the OP (utility functions containing lots of counterfactuals over people's preferences, to make them sensitive to people's preferences, rather than wanting to manipulate or similar). However, a non-hacky approach to this would, even for something as simple as the stop button, also include some other components. (Which I think I've acknowledged from the start, never claimed to have a perfect solution to the stop button problem, but I think I hadn't properly considered the problem of exponentially big input spaces, which seems to require a separate solution.)
Roughly yes. (I would pick different distributions, but yes.)
I find it sort of hard to answer this question because I immediately end up back on the flaws I already mentioned in the OP. I'm also not sure whether or not you're including the OOD arguments here. I'll have to return to this tomorrow as it's late and I'm tired and need to go to bed.