There's definitely a tension here between avoiding bad disruptive actions and doing good disruptive actions.
It seems to me like you're thinking about SEM more like a prior that starts out dominant but can get learned away over time. Is that somewhat close to how you're thinking about this tension?
Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered.
One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of "decision paralysis" where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I'm not the first or only one to describe this kind of conservativism, but I don't recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.
This might be a way forward to deal with your "High-Impact Interference" problem. Perhaps preventing an agent to engage in high-impact interference is a necessary part of safe AI. When fulfillment of the primary objective seems to require engaging in high-impact interference, a safe AI might report to a human supervisor that it cannot proceed because of a particular side effect. The human supervisor could then decide whether the system should proceed or not. If the human supervisor makes the judgement the system should proceed, then they can re-specify the objective to permit the potential side effect, by specifying it as part of the primary objective itself.
Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can't do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability.
Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.
Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined.
I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold.
Acknowledgments
We want to thank Stuart Armstrong, Remmelt Ellen, David Lindner, Michal Pokorny, Achyuta Rajaram, Adam Shimi, and Alex Turner for helpful discussions and valuable feedback on earlier drafts of this post.
Fabian Schimpf and Lukas Fluri are part of this year’s edition of the AI Safety Camp. Our gratitude goes to the camp organizers: Remmelt Ellen, Sai Joseph, Adam Shimi, and Kristi Uustalu.
TLDR;
Negative side effects are one class of threats that misaligned AGIs pose to humanity. Many different approaches have been proposed to mitigate or prevent AI systems from having negative side effects. In this post, we present three requirements that a side-effect minimization method (SEM) should fulfill to be applied in the real world and argue that current methods do not yet satisfy these requirements. We also propose future work that could help to solve these requirements.
Introduction
Avoiding negative side-effects of agents acting in environments has been a core problem in AI safety since the field started to be formalized. Therefore, as part of our AI safety camp project, we took a closer look at state-of-the-art approaches like AUP and Relative Reachability.
After months of discussions, we realized that we were confused about how these (and similar methods) could be used to solve problems we care about outside the scope of the typical grid-world environments.
We formalized these discussions into distinct desiderata that we believe are currently not sufficiently addressed and, in part, maybe even overlooked.
This post attempts to summarize these points and provide structured arguments to support our critique. Of course, we expect to be partially wrong about this, as we updated our beliefs even while writing up this post. We welcome any feedback or additional input to this post.
The sections after the summary table and anticipated questions contain our reasoning for the selected open problems and do not need to be read in order.
Background
The following paragraphs make heavy use of the following terms and side-effect minimization methods (SEMs). For a more detailed explanation we refer to the provided links
MDP: A Markov Decision Process is a 5-tuple ⟨S,A,T,R,γ⟩ In the setting of side-effect minimization, the goal generally is to maximize the cumulative reward without causing (negative) side-effects.
RR: In its simplest form Stepwise Relative Reachability is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the compositionr(st,at,st+1)=R(st,at,st+1)−λ⋅dRR(st+1,s′t+1) where dRR(st+1,s′t+1)=1|S|∑s∈Smax(R(s′st+1;s)−R(st+1;s),0) is a deviation measure punishing the agent if the average “reachability” of all states of the MDP has been decreased by taking action at compared to taking a baseline action anop (like doing nothing). The idea is that side-effects reduce the reachability of certain states (i.e. breaking a vase makes all states that require an intact vase unreachable) and punishing such a decrease in reachability hence also punishes the agent for side-effects.
AUP: Attainable Utility Preservation (see also here and here) is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the composition r(st,at,st+1)=R(st,at,st+1)−λ⋅dAUP(st,at,st+1) where dAUP(st,at,st+1)=1N∑Ri=1|QRi(st,at,st+1−QRi(st,anop,s′t+1)| is a normalized deviation measure punishing the agent if its ability to maximize any of its provided auxiliary reward functions Ri∈R changes by taking action at compared to taking a baseline action anop (like doing nothing). The idea is that the true (side-effect free) reward function (which is very hard to specify) is correlated with many other reward functions. Therefore, if the ability of the agent to maximize auxiliary reward functions Ri∈R gets preserved, chances are high that the true reward function gets preserved as well.
FT: In its simplest form Future Tasks is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the composition r(st,at,st+1)=R(st,at,st+1)+λ⋅dFT(st,at,st+1) where dFT(st,at,st+1)=1|S|⋅D(st)⋅∑|S|iV∗i(st,s′t) is a normalized deviation function rewarding the agent if its ability to maximize any of its provided future task rewards V∗i(st,s′t) is preserved in comparison to if the agent had just remained idle from the very beginning (which would have led him to the state s′t instead). The idea is similar to RR and AUP in that side-effects reduce the ability of the agent to fulfill certain future tasks. By rewarding the agent for preserving its ability to pursue future tasks, the hope is that this will also discourage the agent from creating side-effects. In contrast to the previous two methods, the future tasks method compares the agent’s power to a counterfactual world where the agent would have never been turned on until the current time step t.
Summary
In the following four sections, we’re going to define what the goal of a side-effect minimization method should be. We then argue that to apply a side-effect minimization method in the real world, it needs to satisfy (among other things) the following three requirements:
We tried to split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms. An analysis of three state-of-the-art side-effect minimization methods shows that none of them can fulfill all three requirements, with some partially solving one of the requirements. A summary of our analysis of the three SEM methods can be found in the table below:
Partial Observability
and Chaos
❌ Reachability function and value functions have to be approximated and learnt during exploration phase
❌ Only empirical evidence on a small set of small environments is provided
❌ Method requires complete observability in the form of MDP
❌ Even hard to scale beyond grid worlds
❌ Method requires policy rollouts which are impossible to compute properly due to accumulation of uncertainties
❌ Method makes no distinction between good and bad high impact
(❌) The authors point out interference as one of the main problems that RR addresses. However, depending on the choice of baseline the results can vary
❌ Auxiliary Q-values have to be learnt during exploration phase
(✅) Some guarantees about how to safely choose the impact degree of an agent
(✅) Guarantees that Q_R_AUP converges with probability of one
❌ Method requires policy rollouts which are impossible to compute properly due to accumulation of uncertainties
(❌) Current method requires complete observability in the form of MDP. However, it should work if you are able to learn a value function in your environment
❌ Method makes no distinction between good and bad high impact
❌ Strives for non-interference and corrigibility
❌ Auxiliary Q-values have to be learnt during exploration phase
❌ Only empirical evidence on a small set of small environments provided
❌ Method requires complete observability in the form of MDP
❌ Accumulation of uncertainties will make it impossible to properly compute future task reward
❌ Method makes no distinction between good and bad high impact
❌ Presence of other agents impacts baselines and thus weakens/breaks safety guarantees
(see the section Appendix)
Anticipated Questions
Why do you only analyze these three methods shown above?
There are about ten different side-effect minimization approaches, including impact regularization, future tasks, human feedback approaches, inverse reinforcement learning, reward uncertainty, environment shaping, and others. We chose to limit ourselves to the three methods above because they seem to embody the field’s state of the art, and we wanted to keep the scope concise and readable. We expect our results to generalize in that none of the existing methods can feasibly satisfy all three requirements. However, it might be possible for individual methods to fulfill some of them partially.
Can you provide any empirical evidence for your claims about the behavior of current SEM methods?
We have not yet done any experiments to support our claims. We chose to only provide arguments and intuition for now. If our ideas show to have merit, we will look to improve them further with experiments.
Why High-Impact Interference?
Our argumentation may not be coherent with current desiderata for AGI development. However, the question boils down to whether we expect a potential aligned AI to guard humanity against other (unaligned) AIs or if we expect that we find another way of safeguarding humanity against this threat. Without leveraging an AI to do our bidding, it seems that not developing AGIs and banning progress on AI research would be an alternative.
Goals of Side-Effect Minimization
Axiom 1: There are practically infinitely many states in the universe
Axiom 2: Practically, we can only assign calibrated, human-aligned values to a small subset of these states. Intuition for this:
Axiom 3: Not knowing or ignoring the value of some states can lead to catastrophic side-effects for humans
Conclusion 1: How can we make sure that states not considered in our rewards/values are not changed in a “bad” way because we “forgot” / were not able to include them in our reward function? (axioms 1 & 2)
Conclusion 2: Therefore, we need a way of abstractly assigning value to the world with “blanked statements” that avoid catastrophic side effects of the unbounded pursuit of rewards (axioms 1 & 2, conclusion 1)
Open Problems
Side-Effect Minimization Guarantees
In this section, we argue that an SEM should provide guarantees about its safety before it is allowed to act in the real world. More generally, it should give guarantees on its requirements (i.e., in which settings it works properly) and its goals (i.e., which type of side-effects it successfully prevents). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.
Axioms
Conclusion
The first interaction with the real world requires a fully functional side-effect minimization strategy. Argumentation for this:
State-of-the-Art
Current side-effect minimization methods require a "warm-up" period to gather information about their environment (e.g., learning q-values). This is problematic since:
More specifically, the different methods have the following problems:
The General Problem
Current methods provide only empirical evidence that a trained agent can perform tasks with minimal side-effects in a limited set of environments on a limited set of problem settings. Mathematical guarantees/bounds/frameworks are needed to understand how methods would work before they are converged, which tasks can be successfully accomplished and which assumptions are required for all the above. In a certain sense, this is true for all ML problems in general. However, since we are dealing with potentially potent AGI systems, it is essential to get it right on the first try as simply iteratively improving such a system (which is the default thing to do in standard ML systems) is not guaranteed to work with AGI.
Potential Future Work
Partial Observability and Chaotic Systems
This section argues that an SEM needs to work in partially observable systems with uncertainty and highly chaotic environments. First, we split up our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.
Axioms
Conclusions
State-of-the-Art
Current methods expect their environment to be completely observable. This is highly non-trivial if not impossible in complex environments with other (potentially intelligent) agents (such as humans). This is insufficient for our needs!
More specifically, the different methods have the following problems:
Potential Future Work
High-Impact Interference
This section argues that an SEM must not prevent all high-impact side-effects as it might be necessary to have high-impact in some cases (especially in multi-agent scenarios). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.
Axioms
Conclusion
Side-effect minimization methods must not prevent all high-impact actions! Argumentation:
State-of-the-Art
The main problem of existing side-effect minimization methods is that they can't distinguish between "good" and "bad" high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off). All current SEM methods then chose to solve this problem by preventing all high-impact actions except those that are not explicitly exempted (for example, via direct encouragement by a reward function). However, since it is infeasible to directly specify all possible high-reward functions in the reward function, this is not a viable solution. This is problematic!
More specifically, the different methods have the following problems:
Potential Future Work
Appendix - Hypothesis: Future Tasks is Unsafe in Multi-Agent Scenario
Recap: How the Future Tasks Algorithm Works:
Main Issue
In order to avoid interference incentives, raux(sT) is designed to be maximized by a baseline policy π′ (such as doing nothing). I.e. no other policy can achieve a higher auxiliary reward than π′.
How This Might Backfire in our High-Impact Interference Scenario:
Axioms
Conclusion