Introducing Corrigibility (an FAI research subfield)

So8res

53 Introducing Corrigibility (an FAI research subfield)

20th Oct 2014

4 min read

53

Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system "corrigible" if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:

As AI systems grow more intelligent and autonomous, it becomes increasingly important that they pursue the intended goals. As these goals grow more and more complex, it becomes increasingly unlikely that programmers would be able to specify them perfectly on the first try.

Contemporary AI systems are correctable in the sense that when a bug is discovered, one can simply stop the system and modify it arbitrarily; but once artificially intelligent systems reach and surpass human general intelligence, an AI system that is not behaving as intended might also have the ability to intervene against attempts to "pull the plug".

Indeed, by default, a system constructed with what its programmers regard as erroneous goals would have an incentive to resist being corrected: general analysis of rational agents.¹ has suggested that almost all such agents are instrumentally motivated to preserve their preferences, and hence to resist attempts to modify them [3, 8]. Consider an agent maximizing the expectation of some utility function U. In most cases, the agent's current utility function U is better fulfilled if the agent continues to attempt to maximize U in the future, and so the agent is incentivized to preserve its own U-maximizing behavior. In Stephen Omohundro's terms, "goal-content integrity'' is an instrumentally convergent goal of almost all intelligent agents [6].

This holds true even if an artificial agent's programmers intended to give the agent different goals, and even if the agent is sufficiently intelligent to realize that its programmers intended to give it different goals. If a U-maximizing agent learns that its programmers intended it to maximize some other goal U*, then by default this agent has incentives to prevent its programmers from changing its utility function to U* (as this change is rated poorly according to U). This could result in agents with incentives to manipulate or deceive their programmers.²

As AI systems' capabilities expand (and they gain access to strategic options that their programmers never considered), it becomes more and more difficult to specify their goals in a way that avoids unforeseen solutions—outcomes that technically meet the letter of the programmers' goal specification, while violating the intended spirit.³Simple examples of unforeseen solutions are familiar from contemporary AI systems: e.g., Bird and Layzell [2] used genetic algorithms to evolve a design for an oscillator, and found that one of the solutions involved repurposing the printed circuit board tracks on the system's motherboard as a radio, to pick up oscillating signals generated by nearby personal computers. Generally intelligent agents would be far more capable of finding unforeseen solutions, and since these solutions might be easier to implement than the intended outcomes, they would have every incentive to do so. Furthermore, sufficiently capable systems (especially systems that have created subsystems or undergone significant self-modification) may be very difficult to correct without their cooperation.

In this paper, we ask whether it is possible to construct a powerful artificially intelligent system which has no incentive to resist attempts to correct bugs in its goal system, and, ideally, is incentivized to aid its programmers in correcting such bugs. While autonomous systems reaching or surpassing human general intelligence do not yet exist (and may not exist for some time), it seems important to develop an understanding of methods of reasoning that allow for correction before developing systems that are able to resist or deceive their programmers. We refer to reasoning of this type as corrigible.

¹_{Von Neumann-Morgenstern rational agents [7], that is, agents which attempt to maximize expected utility according to some utility function)}

²_{In particularly egregious cases, this deception could lead an agent to maximize U* only until it is powerful enough to avoid correction by its programmers, at which point it may begin maximizing U. Bostrom [4] refers to this as a "treacherous turn''.}

³_{Bostrom [4] calls this sort of unforeseen solution a "perverse instantiation''.}

(See the paper for references.)

This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.

Before we build generally intelligent systems, we will require some understanding of what it takes to be confident that the system will cooperate with its programmers in addressing aspects of the system that they see as flaws, rather than resisting their efforts or attempting to hide the fact that problems exist. We will all be safer with a formal basis for understanding the desired sort of reasoning.

As demonstrated in this paper, we are still encountering tensions and complexities in formally specifying the desired behaviors and algorithms that will compactly yield them. The field of corrigibility remains wide open, ripe for study, and crucial in the development of safe artificial generally intelligent systems.

Academic PapersCorrigibilityAI

Frontpage

53

New Comment

Rendering 0/28 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 2:35 PM

Moderation Log

53 Introducing Corrigibility (an FAI research subfield)

by So8res

20th Oct 2014

4 min read

53

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system "corrigible" if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

As AI systems grow more intelligent and autonomous, it becomes increasingly important that they pursue the intended goals. As these goals grow more and more complex, it becomes increasingly unlikely that programmers would be able to specify them perfectly on the first try.

Contemporary AI systems are correctable in the sense that when a bug is discovered, one can simply stop the system and modify it arbitrarily; but once artificially intelligent systems reach and surpass human general intelligence, an AI system that is not behaving as intended might also have the ability to intervene against attempts to "pull the plug".

Indeed, by default, a system constructed with what its programmers regard as erroneous goals would have an incentive to resist being corrected: general analysis of rational agents.¹ has suggested that almost all such agents are instrumentally motivated to preserve their preferences, and hence to resist attempts to modify them [3, 8]. Consider an agent maximizing the expectation of some utility function U. In most cases, the agent's current utility function U is better fulfilled if the agent continues to attempt to maximize U in the future, and so the agent is incentivized to preserve its own U-maximizing behavior. In Stephen Omohundro's terms, "goal-content integrity'' is an instrumentally convergent goal of almost all intelligent agents [6].

This holds true even if an artificial agent's programmers intended to give the agent different goals, and even if the agent is sufficiently intelligent to realize that its programmers intended to give it different goals. If a U-maximizing agent learns that its programmers intended it to maximize some other goal U*, then by default this agent has incentives to prevent its programmers from changing its utility function to U* (as this change is rated poorly according to U). This could result in agents with incentives to manipulate or deceive their programmers.²

As AI systems' capabilities expand (and they gain access to strategic options that their programmers never considered), it becomes more and more difficult to specify their goals in a way that avoids unforeseen solutions—outcomes that technically meet the letter of the programmers' goal specification, while violating the intended spirit.³Simple examples of unforeseen solutions are familiar from contemporary AI systems: e.g., Bird and Layzell [2] used genetic algorithms to evolve a design for an oscillator, and found that one of the solutions involved repurposing the printed circuit board tracks on the system's motherboard as a radio, to pick up oscillating signals generated by nearby personal computers. Generally intelligent agents would be far more capable of finding unforeseen solutions, and since these solutions might be easier to implement than the intended outcomes, they would have every incentive to do so. Furthermore, sufficiently capable systems (especially systems that have created subsystems or undergone significant self-modification) may be very difficult to correct without their cooperation.

In this paper, we ask whether it is possible to construct a powerful artificially intelligent system which has no incentive to resist attempts to correct bugs in its goal system, and, ideally, is incentivized to aid its programmers in correcting such bugs. While autonomous systems reaching or surpassing human general intelligence do not yet exist (and may not exist for some time), it seems important to develop an understanding of methods of reasoning that allow for correction before developing systems that are able to resist or deceive their programmers. We refer to reasoning of this type as corrigible.

¹_{Von Neumann-Morgenstern rational agents [7], that is, agents which attempt to maximize expected utility according to some utility function)}

²_{In particularly egregious cases, this deception could lead an agent to maximize U* only until it is powerful enough to avoid correction by its programmers, at which point it may begin maximizing U. Bostrom [4] refers to this as a "treacherous turn''.}

³_{Bostrom [4] calls this sort of unforeseen solution a "perverse instantiation''.}

(See the paper for references.)

Before we build generally intelligent systems, we will require some understanding of what it takes to be confident that the system will cooperate with its programmers in addressing aspects of the system that they see as flaws, rather than resisting their efforts or attempting to hide the fact that problems exist. We will all be safer with a formal basis for understanding the desired sort of reasoning.

As demonstrated in this paper, we are still encountering tensions and complexities in formally specifying the desired behaviors and algorithms that will compactly yield them. The field of corrigibility remains wide open, ripe for study, and crucial in the development of safe artificial generally intelligent systems.

Academic PapersCorrigibilityAI

Frontpage

53

Mentioned in

21MIRI's technical agenda: an annotated bibliography, and other updates

New Comment

Rendering 0/28 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 2:35 PM

Moderation Log

More from So8res

Curated and popular this week

28Comments

Comment Permalink

lackofcheese12y100

I think I can give a concise unification of my idea with Karl's. In short, the comment in the paper that

The concerns in Section 4.2 could potentially be addressed by some form of counterfactual (rather than conditional) reasoning.

is correct, and the fix is a pretty simple one. Basically, we want the following:

In selecting a1, the agent should act as though it was indifferent between the counterfactual possibilities of shutting down and not shutting down, conditional on the same actions and observations.
In selecting a2, the agent should desire to shut down when it is required to, and desire not to shut down when it is not required to.
Despite its indifference according to (1), the agent should care about its future corrigibility (i.e. its ability to actually shut down when the button is pressed).

One utility function that satisfies these requirements, in a general form, is U(a1, o, a2) = {
(1-p) UN(a1, o, a2) + p max_b(US(a1, o, b)), if o is not in Press
p US(a1, o, a2) + (1-p) max_b(UN(a1, o, b)), if o is in Press
}, where p is a fixed constant.

From the agent's point of view, this is equivalent to deciding a1 based on an a priori assumption that it will shut down (maximise US) with a fixed probability p, and not shut down (maximise UN) otherwise. However, in selecting a2 it will in fact maximise either UN or US depending upon whether the shutdown button was pressed or not. As a particular example, Karl's proposition below is equivalent to the limit of this general case as p->0. See below for further details on how/why this utility function achieves the three criteria above.

We don't want the agent to be indifferent to the actual event of the shutdown button being pressed. In fact, what we want is indifference as to whether or not any given event is classed as being a shutdown press. Thus the expected utility of (a1, o) for o in Press should be equivalent to the expected utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa. To do this, the utilities of both branches of U (the Press branch and the NotPress branch) need to be made counterfactually equivalent, even though the agent selects a2 via UN in the NotPress branch and via US in the Press branch. This is precisely what Karl's modification achieves.
This is pretty easy to achieve by ensuring that only the UN component of the NotPress branch depends upon a2, and only the US component of the Press branch depends upon a2.
We want the agent to be incentivised to take actions that are compatible with future shutdown, which means that the agent's selection of a1 should be influenced by US as well as UN, but without introducing perverse incentives. Fortunately, the counterfactual indifference approach already solves the perversity issue, and so all that remains is to appropriately combine the influence of both UN and US. A reasonable approach is for the agent to act as though it will have to shut down with a fixed probability p, regardless of its observations. Because p is fixed, and because the agent is indifferent between any given observation being classed as a button press or not a button press, the agent is not in any way incentivised to influence the likelihood of the shutdown button being pressed.

So8res12y50

Thanks, and nice work!

Thus the utility of (a1, o) for o in Press should be equivalent to the utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa

Yeah, this is pretty key. You need it to optimize for both cases as if the probability of the button being pressed is fixed and independent of whether the programmers actually press the button. We can achieve this via a causal intervention on whether or not the button is pressed, and then clean up your U a bit by redefining it as follows:

U(a1, o, a2) :=
{  UN(

... (read more)

See in context