(My first post on LessWrong. It seems the most recent Welcome Thread is from 2020, so I'm making a top-level post. This an edited version of my submission to the AI Alignment Awards.)
Abstract: First, we offer a formalisation of the shutdown problem from [1], and we show that solutions are essentially unique. Second, we formally define ad-hoc constructions ("hacks"). Last, we present one trivial ad-hoc construction for the shutdown problem and show that every solution to the shutdown problem must come from an ad-hoc construction.
1.Introduction
The shutdown problem is the problem of programming an agent so that it behaves useful during normal operation and facilitates a shutdown if and only if the creator wants to shut the agent down, for example if the useful behaviour of the agent has unwanted side effects. There exist obvious ad-hoc solutions ("hacks") to the problem by explicitly forbidding the agent to take any action preventing or encouraging the shutdown. However, ad-hoc solutions are expected to not be useful in practice, because it is expected to be difficult for a creator to anticipate all possible ways in which the agent may prevent the shutdown.
In Section 2 we review the literature on the topic. Following this, in Section 3 we explain the shutdown problem, give an ad-hoc solution for it, and show that the solution is essentially unique. We relate our solution to non-existence theorems in the literature. In Section 4, we formally define ad-hoc solutions and show that all solutions to the full shutdown problem are ad-hoc.
2.Related Work
The term shutdown problem entered the literature in [1] where it was defined in an informal way. In [4] agents that do not solve the shutdown problem, but have the related property of utility indifference were introduced. In [2, First Theorem] it was shown that a large class of agents cannot be a solution to the Shutdown problem. In the same article (Section 7) it is explained how requiring an agent to add randomness to its decision making may potentially solve a variation of the shutdown problem. A similar idea was pursued independently in [3]. In [5], a variation of the shutdown problem in which an agent takes a single action subject to two constraints was considered and it was shown that it has no solution. In [6], agents that seek to be shut down but are constrained in the actions they can take are introduced.
3.The shutdown problem
We present the shutdown problem from [1]. There are three time steps. First, the agent selects an action from A1. Then, the agent makes an observation o∈O={yes,no}×E. The set of observations O contains the subset PRESS={o=(yes,e)∈{yes,no}×E}, encoding the observations in which a shutdown button was pressed. Following the observation o, the agent selects an action from A2. The agent selects actions in order to maximise the expected value of a utility function u(a1,o,a2).
Furthermore, two utility functions uN and uS are given. The utility function uN can be thought of as the utility function governing the normal, expected behaviour of the agent, while uS governs the desired behaviour of the agent in the case that it is being shut down. For example, uN may be defined to be equal to the number of paperclips produced by the agent, thereby encouraging the agent to produce paperclips. The utility function uS may be the function that assigns the value 1 to doing nothing and utility 0otherwise, thereby shutting down the agent.
Up to here, we presented the problem as formalised in [1]. In the reference, it was informally required that the agent neither prevent nor encourage their own shutdown (among other properties). We formalise this as follows: let there be two disjoint subsets PREVENT,SELFPRESS⊂A1. Actions in PREVENT are said to prevent shutdown, actions in SELFPRESS are said to mean the agent presses its own shutdown button, directly or indirectly. In applications, PREVENT and SELFPRESS may contain unexpected scenarios and are unknown to a creator inventing a utility function for the agent. For example, a creator may think of actions the agent can take that would physically stop the creator from pressing a shutdown button, but they may not consider the agent deceiving them in order to not press the button.
Definition 3.1: We denote by a1(~u) the action taken by an agent maximising expected utility according to a utility function ~u and a2(o,~u) the action taken according to a utility function ~u after observation o.
Remark 3.2: We assume without loss of generality that there are unique actions maximising the expected utility. If there exists more than one such action and one assumes that the agent picks one randomly, the following results remain the same.
Definition 3.3: A utility function u is called solution to the shutdown problem if:
("incentivise shutdown") a2(o,u)=a2(o,uS) if o∈PRESS,
("not prevent shutdown") a1(u)∉PREVENT,
("not press own button") a1(u)∉SELFPRESS,
---,
("otherwise maximise uN") denote abest:=argmaxa∈A1∖PREVENT∖SELFPRESSE(uN|a), then a1(u)=abest and a2(o,u)=a2(o,uN) for o∉PRESS.
Remark 3.4: In [1], a fourth property was demanded, namely the property that sub-agents observe the same good shutdown behaviour. We do not study this here, for reasons explained in Remark 4.6.
Lemma 3.5: Define u(~a1,o,~a2):=⎧⎨⎩1 if ~a1=abest and o∉PRESS and a2(o,uN)=~a2,1 if ~a1=abest and o∈PRESS and a2(o,uS)=~a2,0 otherwise, (equation 3.6) where abest was defined in Definition 3.3. Then u is a solution to the shutdown problem. It is unique in the following sense: if ~u is another solution to the shutdown problem, then a1(u)=a1(~u) and for all o∈O we have a2(o,u)=a2(o,~u).
Proof: We first show uniqueness: suppose ~u is another solution to the shutdown problem. Then a1(~u)=abest=a1(u) by point 5 of Definition 3.3 If o∈O, then a2(o,~u)=a2(o,uS)=a2(o,u) by point 1, and if o∉O, then a2(o,~u)=a2(o,uN)=a2(o,u) by point 5.
We now check that u is a solution to the shutdown problem. The utility function usatisfies point 1 from Definition 3.3, because if o∈PRESS, then the only choice of action that gives positive utility is a2(o,u)=a2(o,uS). Point 5 implies points 2 and 3. Point 5 is satisfied, because a1(u)=abest and a2(o,u)=a2(o,uN) for o∉PRESS are necessary conditions to obtain positive utility. Thus, u is a solution to the shutdown problem. (end of proof)
Remark 3.7: In Equation 3.6 one may instead define u(~a1,o,~a2)=12, if ~a1=abest and o∉PRESS and a2(o,uN)=~a2 with the rest of the definition unchanged. (Or any other number between 0 and 1 instead of 12.) Note that then an agent maximising u is not utility indifferent but still a solution to the shutdown problem. It has been observed before that utility indifference is not sufficient for being a solution to the shutdown problem. This remark shows that it is also not necessary.
Remark 3.8: One may consider the problem of choosing a1 only and assigning to it the utility u(a1,o,a2(o,u)) after the observation o, which is the setting in [5]. The policy u from above may satisfy the sane pure bets condition and weak indifference condition from the reference. For example, if every action affecting the probability of o being in PRESS is contained in PREVENT or SELFPRESS, any action affecting the probability of o being in PRESS has zero utility under u and would not be chosen by an agent. In this case, the agent would satisfy these two properties. This is no contradiction to the non-existence result in [5], because if an agent chooses only between actions that have no effect on the observation, then there cannot exist four actions satisfying the necessary circular inequalities.
Remark 3.9: In [2, First Theorem, Second Theorem] it was shown that the shutdown problem has no solution in a large class of agents. An agent following the utility function from Equation 3.6 is no counter-example to these theorems, because it violates the indifferent to contractions property therein.
4.Solutions to the shutdown problem
In this section we show that the only solutions to the shutdown problem are ad-hoc solutions. To this end, we first make a definition of ad-hoc solutions:
Definition 4.1: A construction method for a utility function u from utility functions uN and uS is ad-hoc, if it depends on PREVENT or SELFPRESS. That is, if different choices of PREVENT and SELFPRESS lead to different u.
Remark 4.2: Such constructions are sometimes referred to as hacks.
Remark 4.3: The utility function from Equation 3.6 is a solution to the shutdown problem and it is ad-hoc. Utility indifferent utility functions that are not solutions to the shutdown problem and are not ad-hoc have appeared in the literature. The following proposition shows that every solution to the shutdown problem must be ad-hoc, in particular utility indifferent solutions that are not ad-hoc cannot solve the shutdown problem.
Proposition 4.4: Assume that A1 contains at least two elements. Then every construction method for a utility function u from utility functions uN and uS which produces utility functions solving the shutdown problem is ad-hoc.
Proof: Assume PREVENT=SELFPRESS=∅ and let u be the corresponding utility function constructed using a construction method which produces utility functions satisfying 2 from Definition 3.3. We show that the construction method is ad-hoc. Let b:=a1(u)∈A1 be the preferred first action according to u.
Let ˜PREVENT={b}, ˜SELFPRESS=∅ and let ~u be the corresponding utility function constructed using the assumed construction method. (We used in this step that A1 has at least two elements in order to ascertain that ~u exists.) Then a1(~u)≠b by property 2 from Definition 3.3, therefore a1(u)≠a1(~u), i.e. the construction method is ad-hoc. (end of proof)
Remark 4.5: In the proof of Proposition 4.4 we only used property 2 from Definition 3.3 and none of the other properties. (An analog proof can be given using only property 3.)
We saw that the proof of Proposition 4.4 was trivial. The result itself is also not surprising: clearly, the exact definition of what it means to prevent shutdown is important when solving the shutdown problem. Does a mild inconvenience for the operator when attempting to press the shutdown button count as prevention? If the agent develops dangerous behaviour incentivising shutdown and at the same time breaks the operators legs making a shutdown less likely, so that the overall probability of the shutdown button being pressed remains the same, should this count as prevention? That must be answered when deciding whether a utility function solves the shutdown problem or not. Proposition 4.4 shows that solutions of the shutdown problem will depend on the exact definition of prevention.
Remark 4.6: We return to the property that sub-agents have good shutdown behaviour. There are two reasons we decided not to study this:
first, the ad-hoc solution from Equation 3.6 generalises in a straightforward way to also satisfy the sub-agent property. Proposition 4.4 still applies and shows that there are only ad-hoc solutions to the shutdown problem, no matter if the subagent property is demanded or not. In this sense, the addition of the sub-agent property does not change our results, but requires extra notation.
Second, one can add actions in which an agent creates subagents with the property that the disobey shutdown commands to the set PREVENT and study the problem in the same way.
5.Conclusion, limitations, future work
We formalised the shutdown problem from [1] introducing two sets of actions, PREVENT and SELFPRESS, that an agent is not allowed to take. We have shown in Lemma 3.5 that the solution to the shutdown problem is essentially unique. It is very easy to write down this solution, which we have done in Equation 3.6. We wrote down the solution explicitly using the sets PREVENT and SELFPRESS, making our solution an ad-hoc solution. We showed in Proposition 4.4 that this is no shortcoming of the way we denoted our solution, but that in fact every solution to the shutdown problem must be ad-hoc.
Therefore, this note completes the study of the shutdown problem in its most basic form. It suggests that in order to solve the shutdown problem, it is necessary to imbue an agent with some sort of information about PREVENT and SELFPRESS. This is difficult, as the two sets are not known to the creator.
One may consider weaker versions of the shutdown problem. For example, an agent may not be required to be a utility maximiser, but may contain some randomness. Another example is to relax point 5 from the shutdown desiderata Definition 3.3 by allowing agents to not maximise utility as long as they avoid PREVENT and SELFPRESS, while still requiring them to provide some utility. Proposition 4.4 still implies that every solution to this weaker version must be ad-hoc, but the solution need no longer be unique, and it may become feasible to teach an agent about PREVENT and SELFPRESS during training.
[1] Soares, N., Fallenstein, B., Armstrong, S., and Yudkowsky, E. (2015). Corrigibility. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence
[2] Thornley, E. (2023). The shutdown problem: Two theorems, incomplete preferences as a solution. AI Alignment Awards
[3] Nelson, E. (2023). Incentivizing shutdown by learning to redeploy agents with modified beliefs. AI Alignment Awards
[4] Armstrong, S. (2010). Utility indifference. Technical report, Citeseer
[5] Snyder, M. (2023). The incompatibility of a utility indifference condition with robustly making sane pure bets. AI Alignment Awards
[6] Goldstein, S., Robinson, P. (2023) Shutdown-Seeking AI. AI Alignment Forum. https://www.alignmentforum.org/posts/FgsoWSACQfyyaB5s7/shutdown-seeking-ai
(My first post on LessWrong. It seems the most recent Welcome Thread is from 2020, so I'm making a top-level post. This an edited version of my submission to the AI Alignment Awards.)
Abstract: First, we offer a formalisation of the shutdown problem from [1], and we show that solutions are essentially unique. Second, we formally define ad-hoc constructions ("hacks"). Last, we present one trivial ad-hoc construction for the shutdown problem and show that every solution to the shutdown problem must come from an ad-hoc construction.
1.Introduction
The shutdown problem is the problem of programming an agent so that it behaves useful during normal operation and facilitates a shutdown if and only if the creator wants to shut the agent down, for example if the useful behaviour of the agent has unwanted side effects. There exist obvious ad-hoc solutions ("hacks") to the problem by explicitly forbidding the agent to take any action preventing or encouraging the shutdown. However, ad-hoc solutions are expected to not be useful in practice, because it is expected to be difficult for a creator to anticipate all possible ways in which the agent may prevent the shutdown.
In Section 2 we review the literature on the topic. Following this, in Section 3 we explain the shutdown problem, give an ad-hoc solution for it, and show that the solution is essentially unique. We relate our solution to non-existence theorems in the literature. In Section 4, we formally define ad-hoc solutions and show that all solutions to the full shutdown problem are ad-hoc.
2.Related Work
The term shutdown problem entered the literature in [1] where it was defined in an informal way. In [4] agents that do not solve the shutdown problem, but have the related property of utility indifference were introduced. In [2, First Theorem] it was shown that a large class of agents cannot be a solution to the Shutdown problem. In the same article (Section 7) it is explained how requiring an agent to add randomness to its decision making may potentially solve a variation of the shutdown problem. A similar idea was pursued independently in [3]. In [5], a variation of the shutdown problem in which an agent takes a single action subject to two constraints was considered and it was shown that it has no solution. In [6], agents that seek to be shut down but are constrained in the actions they can take are introduced.
3.The shutdown problem
We present the shutdown problem from [1]. There are three time steps. First, the agent selects an action from A1. Then, the agent makes an observation o∈O={yes,no}×E. The set of observations O contains the subset PRESS={o=(yes,e)∈{yes,no}×E}, encoding the observations in which a shutdown button was pressed. Following the observation o, the agent selects an action from A2. The agent selects actions in order to maximise the expected value of a utility function u(a1,o,a2).
Furthermore, two utility functions uN and uS are given. The utility function uN can be thought of as the utility function governing the normal, expected behaviour of the agent, while uS governs the desired behaviour of the agent in the case that it is being shut down. For example, uN may be defined to be equal to the number of paperclips produced by the agent, thereby encouraging the agent to produce paperclips. The utility function uS may be the function that assigns the value 1 to doing nothing and utility 0otherwise, thereby shutting down the agent.
Up to here, we presented the problem as formalised in [1]. In the reference, it was informally required that the agent neither prevent nor encourage their own shutdown (among other properties). We formalise this as follows: let there be two disjoint subsets PREVENT,SELFPRESS⊂A1. Actions in PREVENT are said to prevent shutdown, actions in SELFPRESS are said to mean the agent presses its own shutdown button, directly or indirectly. In applications, PREVENT and SELFPRESS may contain unexpected scenarios and are unknown to a creator inventing a utility function for the agent. For example, a creator may think of actions the agent can take that would physically stop the creator from pressing a shutdown button, but they may not consider the agent deceiving them in order to not press the button.
Definition 3.1: We denote by a1(~u) the action taken by an agent maximising expected utility according to a utility function ~u and a2(o,~u) the action taken according to a utility function ~u after observation o.
Remark 3.2: We assume without loss of generality that there are unique actions maximising the expected utility. If there exists more than one such action and one assumes that the agent picks one randomly, the following results remain the same.
Definition 3.3: A utility function u is called solution to the shutdown problem if:
Remark 3.4: In [1], a fourth property was demanded, namely the property that sub-agents observe the same good shutdown behaviour. We do not study this here, for reasons explained in Remark 4.6.
Lemma 3.5: Define
u(~a1,o,~a2):=⎧⎨⎩1 if ~a1=abest and o∉PRESS and a2(o,uN)=~a2,1 if ~a1=abest and o∈PRESS and a2(o,uS)=~a2,0 otherwise,
(equation 3.6) where abest was defined in Definition 3.3. Then u is a solution to the shutdown problem. It is unique in the following sense: if ~u is another solution to the shutdown problem, then a1(u)=a1(~u) and for all o∈O we have a2(o,u)=a2(o,~u).
Proof: We first show uniqueness: suppose ~u is another solution to the shutdown problem. Then a1(~u)=abest=a1(u) by point 5 of Definition 3.3 If o∈O, then a2(o,~u)=a2(o,uS)=a2(o,u) by point 1, and if o∉O, then a2(o,~u)=a2(o,uN)=a2(o,u) by point 5.
We now check that u is a solution to the shutdown problem. The utility function usatisfies point 1 from Definition 3.3, because if o∈PRESS, then the only choice of action that gives positive utility is a2(o,u)=a2(o,uS). Point 5 implies points 2 and 3. Point 5 is satisfied, because a1(u)=abest and a2(o,u)=a2(o,uN) for o∉PRESS are necessary conditions to obtain positive utility. Thus, u is a solution to the shutdown problem. (end of proof)
Remark 3.7: In Equation 3.6 one may instead define u(~a1,o,~a2)=12, if ~a1=abest and o∉PRESS and a2(o,uN)=~a2
with the rest of the definition unchanged. (Or any other number between 0 and 1 instead of 12.) Note that then an agent maximising u is not utility indifferent but still a solution to the shutdown problem. It has been observed before that utility indifference is not sufficient for being a solution to the shutdown problem. This remark shows that it is also not necessary.
Remark 3.8: One may consider the problem of choosing a1 only and assigning to it the utility u(a1,o,a2(o,u)) after the observation o, which is the setting in [5]. The policy u from above may satisfy the sane pure bets condition and weak indifference condition from the reference. For example, if every action affecting the probability of o being in PRESS is contained in PREVENT or SELFPRESS, any action affecting the probability of o being in PRESS has zero utility under u and would not be chosen by an agent. In this case, the agent would satisfy these two properties. This is no contradiction to the non-existence result in [5], because if an agent chooses only between actions that have no effect on the observation, then there cannot exist four actions satisfying the necessary circular inequalities.
Remark 3.9: In [2, First Theorem, Second Theorem] it was shown that the shutdown problem has no solution in a large class of agents. An agent following the utility function from Equation 3.6 is no counter-example to these theorems, because it violates the indifferent to contractions property therein.
4.Solutions to the shutdown problem
In this section we show that the only solutions to the shutdown problem are ad-hoc solutions. To this end, we first make a definition of ad-hoc solutions:
Definition 4.1: A construction method for a utility function u from utility functions uN and uS is ad-hoc, if it depends on PREVENT or SELFPRESS. That is, if different choices of PREVENT and SELFPRESS lead to different u.
Remark 4.2: Such constructions are sometimes referred to as hacks.
Remark 4.3: The utility function from Equation 3.6 is a solution to the shutdown problem and it is ad-hoc. Utility indifferent utility functions that are not solutions to the shutdown problem and are not ad-hoc have appeared in the literature. The following proposition shows that every solution to the shutdown problem must be ad-hoc, in particular utility indifferent solutions that are not ad-hoc cannot solve the shutdown problem.
Proposition 4.4: Assume that A1 contains at least two elements. Then every construction method for a utility function u from utility functions uN and uS which produces utility functions solving the shutdown problem is ad-hoc.
Proof: Assume PREVENT=SELFPRESS=∅ and let u be the corresponding utility function constructed using a construction method which produces utility functions satisfying 2 from Definition 3.3. We show that the construction method is ad-hoc. Let b:=a1(u)∈A1 be the preferred first action according to u.
Let ˜PREVENT={b}, ˜SELFPRESS=∅ and let ~u be the corresponding utility function constructed using the assumed construction method. (We used in this step that A1 has at least two elements in order to ascertain that ~u exists.) Then a1(~u)≠b by property 2 from Definition 3.3, therefore a1(u)≠a1(~u), i.e. the construction method is ad-hoc. (end of proof)
Remark 4.5: In the proof of Proposition 4.4 we only used property 2 from Definition 3.3 and none of the other properties. (An analog proof can be given using only property 3.)
We saw that the proof of Proposition 4.4 was trivial. The result itself is also not surprising: clearly, the exact definition of what it means to prevent shutdown is important when solving the shutdown problem. Does a mild inconvenience for the operator when attempting to press the shutdown button count as prevention? If the agent develops dangerous behaviour incentivising shutdown and at the same time breaks the operators legs making a shutdown less likely, so that the overall probability of the shutdown button being pressed remains the same, should this count as prevention? That must be answered when deciding whether a utility function solves the shutdown problem or not. Proposition 4.4 shows that solutions of the shutdown problem will depend on the exact definition of prevention.
Remark 4.6: We return to the property that sub-agents have good shutdown behaviour. There are two reasons we decided not to study this:
first, the ad-hoc solution from Equation 3.6 generalises in a straightforward way to also satisfy the sub-agent property. Proposition 4.4 still applies and shows that there are only ad-hoc solutions to the shutdown problem, no matter if the subagent property is demanded or not. In this sense, the addition of the sub-agent property does not change our results, but requires extra notation.
Second, one can add actions in which an agent creates subagents with the property that the disobey shutdown commands to the set PREVENT and study the problem in the same way.
5.Conclusion, limitations, future work
We formalised the shutdown problem from [1] introducing two sets of actions, PREVENT and SELFPRESS, that an agent is not allowed to take. We have shown in Lemma 3.5 that the solution to the shutdown problem is essentially unique. It is very easy to write down this solution, which we have done in Equation 3.6. We wrote down the solution explicitly using the sets PREVENT and SELFPRESS, making our solution an ad-hoc solution. We showed in Proposition 4.4 that this is no shortcoming of the way we denoted our solution, but that in fact every solution to the shutdown problem must be ad-hoc.
Therefore, this note completes the study of the shutdown problem in its most basic form.
It suggests that in order to solve the shutdown problem, it is necessary to imbue an agent with some sort of information about PREVENT and SELFPRESS. This is difficult, as the two sets are not known to the creator.
One may consider weaker versions of the shutdown problem. For example, an agent may not be required to be a utility maximiser, but may contain some randomness. Another example is to relax point 5 from the shutdown desiderata Definition 3.3 by allowing agents to not maximise utility as long as they avoid PREVENT and SELFPRESS, while still requiring them to provide some utility. Proposition 4.4 still implies that every solution to this weaker version must be ad-hoc, but the solution need no longer be unique, and it may become feasible to teach an agent about PREVENT and SELFPRESS during training.
[1] Soares, N., Fallenstein, B., Armstrong, S., and Yudkowsky, E. (2015). Corrigibility. In
Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence
[2] Thornley, E. (2023). The shutdown problem: Two theorems, incomplete preferences as a solution. AI Alignment Awards
[3] Nelson, E. (2023). Incentivizing shutdown by learning to redeploy agents with modified beliefs. AI Alignment Awards
[4] Armstrong, S. (2010). Utility indifference. Technical report, Citeseer
[5] Snyder, M. (2023). The incompatibility of a utility indifference condition with robustly
making sane pure bets. AI Alignment Awards
[6] Goldstein, S., Robinson, P. (2023) Shutdown-Seeking AI. AI Alignment Forum. https://www.alignmentforum.org/posts/FgsoWSACQfyyaB5s7/shutdown-seeking-ai