Forum Digest: Corrigibility, utility indifference, & related control ideas

Benya_Fallenstein

This is a quick recap of the posts of this forum that deal with corrigibility (making sure that if you get an agent's goal system wrong, it doesn't try to prevent you from changing it), utility indifference (the idea to remove incentives to manipulate you so that you change or not change the agent's goal system, by adding rewards to its utility function that make it get the same utility in both cases), and related AI control ideas. It's current as of 3/21/15.

Papers

As background to the posts listed below, the following two papers may be helpful.

Corrigibility, by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong (2015). This paper introduces the problem of corrigibility and analyzes some simple models, including a version of Stuart Armstrong's utility indifference. Abstract:

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.
Utility Indifference, by Stuart Armstrong (2010). An older paper by Stuart explaining the utility indifference approach.

Corrigibility

Generalizing the Corrigibility paper's impossibility result?, Benja Fallenstein. The Corrigibility paper looks at a particular linear way to combine two utility functions, $U_{N}$ and $U_{S}$ , which incentivize normal operation and shutdown, respectively. It shows that most such linear combinations lead to unintended behavior. Is it possible to avoid this problem by considering non-linear combinations? It turns out that this question isn't quite well-formed.

Utility indifference

Utility indifference and infinite improbability drives, Benja Fallenstein. (Corrigibility paper doesn't exactly reflect Stuart's approach, and Stuart's approach avoids the exact problem stated in the Corrigibility paper, but it still can be interpreted as shifting the agent's probability distribution, and this still makes the agent do stupid things.)
Un-manipulable counterfactuals, Stuart Armstrong. The Corrigibility paper uses causal counterfactuals, à la Pearl. In this post, Stuart suggests defining counterfactuals by conditioning on a chaotic random event the AI can't influence. For example, we might make it so that an oracle has a low probability of producing no output, and a high probability to output its prediction of what would have happened conditional on it not outputting anything.
Orthogonality: action counterfactuals, Stuart Armstrong. Suggests a version of utility indifference where a shutdown button does not change the agent's utility function directly, but permits the agent to execute an action that changes its utility function; additionally suggests to define the utility of this action to be computed in a similar way as in other versions utility indifference, but with a small additional term $ϵ$ rewarding a change in utility. Argues that this incentivizes the agent to manipulate its operators to press the shutdown button, but only if this action is extremely cheap.

Safe oracles

Predictors that don't try to manipulate you(?), Benja Fallenstein. If you implement an agent whose only goal is to output correct predictions about future events, this agent may still have an incentive to manipulate the environment to make it easier to predict. This post suggests a potential way to define an agent which wants to make correct predictions but does not want to make its environment easier to predict.
Non-manipulative oracles, Stuart Armstrong. Suggests to avoid manipulation by a predictor by having the oracle not predict what will happen in the actual world, but what would happen in a counterfactual world where the oracle didn't produce any output.

Manipulating an agent's beliefs

False Thermodynamic Miracles and False Thermodynamic Miracles, in equation form, Stuart Armstrong. Considers ways to make an agent act as if it believes with probability $\approx 1$ that a certain event will happen which in reality should be assigned a probability $\approx 0$ .
Safe probability manipulation, superweapons, and stable self-improvement research, Stuart Armstrong. Suggests that the fact that for every decision theory, there's an "evil" decision problem on which this decision theory fails, is a way to force an agent to be uncertain about events we are very certain about, even if the agent is much smarter than us.

Low-impact agents

AI-created pseudo-deontology, Stuart Armstrong. Proposes to implement an agent $A$ whose only task is to create an agent $B$ , whose utility function will be modified by some noise before $B$ is run. Argues that this may lead $A$ to create a $B$ which "follows its motivation to some extent, but not to extreme amounts", because $A$ wants $B$ 's behavior to be robust to this noise.
Restrictions that are hard to hack, Stuart Armstrong. Putting specific restrictions on an agent's motivation is problematic as a safety technique, because the agent will usually be able to find unintended instantiations that satisfy the lettr but not the spirit of the restriction. This post suggests that unintended instantiations are more informative about small changes in the restrictions they instantiate than intended instantiations, and suggests to use this to make unintended instantiations less likely.
Creating a satisficer, Stuart Armstrong. Proposes a potential way of creating an agent that tries to act in a way that does well on one utility function $u$ while trying to have little impact on many other utility functions $v$ .

Odds and ends

Resource gathering agent, Stuart Armstrong. Argues that if we take an arbitrary utility function $u$ , and build an agent that assigns 50% probability that it wants to maximize $u$ and 50% probability that it wants to maximize $- u$ (it will find out the truth tomorrow), then we get an agent that is purely interested in convergent instrumental goals like resource gathering. Suggests that if we could somehow "subtract off" such an agent from another agent, we could construct an agent that doesn't try to follow convergent instrumental goals.
Acausal trade barriers, Stuart Armstrong. Suggests a technique similar to utility indifference which may disincentivize agents from acausally trading with each other.
Anti-Pascaline agents, Stuart Armstrong. Given a random variable $X$ , an event $A$ , and a small $ε > 0$ , define ${¯ ¯ ¯ p}_{ε} (X ∣ A)$ to be such that conditional on $A$ , the probability of $X \geq {¯ ¯ ¯ p}_{ε} (X ∣ A)$ is $ε$ . Similarly define ${p -}_{ε} (X ∣ A)$ by replacing $\geq$ by $\leq$ . Then define $E_{ε} [X ∣ A] := E [X^{'} ∣ A]$ , where $X^{'}$ is $X$ bounded to $[{p -}_{ε} (X ∣ A), {¯ ¯ ¯ p}_{ε} (X ∣ A)]$ . Given a utility function $u$ , this post proposes taking the action $a$ that maximizes $E_{ε} [u ∣ a]$ as an "unprincipled" approach to dealing with Pascal's Mugging.

LESSWRONG
LW

LESSWRONG
LW

35

Forum Digest: Corrigibility, utility indifference, & related control ideas

35

Ω 17

Papers

Corrigibility

Utility indifference

Safe oracles

Manipulating an agent's beliefs

Low-impact agents

Odds and ends

35

Ω 17

35

Ω 17