Acknowledgments

We want to thank Stuart Armstrong, Remmelt Ellen, David Lindner, Michal Pokorny, Achyuta Rajaram, Adam Shimi, and Alex Turner for helpful discussions and valuable feedback on earlier drafts of this post.

Fabian Schimpf and Lukas Fluri are part of this year’s edition of the AI Safety Camp. Our gratitude goes to the camp organizers: Remmelt Ellen, Sai Joseph, Adam Shimi, and Kristi Uustalu.

TLDR;

Negative side effects are one class of threats that misaligned AGIs pose to humanity. Many different approaches have been proposed to mitigate or prevent AI systems from having negative side effects. In this post, we present three requirements that a side-effect minimization method (SEM) should fulfill to be applied in the real world and argue that current methods do not yet satisfy these requirements. We also propose future work that could help to solve these requirements.

Introduction

Avoiding negative side-effects of agents acting in environments has been a core problem in AI safety since the field started to be formalized. Therefore, as part of our AI safety camp project, we took a closer look at state-of-the-art approaches like AUP and Relative Reachability. 

After months of discussions, we realized that we were confused about how these (and similar methods) could be used to solve problems we care about outside the scope of the typical grid-world environments. 

We formalized these discussions into distinct desiderata that we believe are currently not sufficiently addressed and, in part, maybe even overlooked. 

This post attempts to summarize these points and provide structured arguments to support our critique. Of course, we expect to be partially wrong about this, as we updated our beliefs even while writing up this post. We welcome any feedback or additional input to this post.

The sections after the summary table and anticipated questions contain our reasoning for the selected open problems and do not need to be read in order. 

Background

The following paragraphs make heavy use of the following terms and side-effect minimization methods (SEMs). For a more detailed explanation we refer to the provided links

MDP:Markov Decision Process is a 5-tuple  In the setting of side-effect minimization, the goal generally is to maximize the cumulative reward without causing (negative) side-effects.

RR: In its simplest form Stepwise Relative Reachability is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function  with the composition where  is a deviation measure punishing the agent if the average “reachability” of all states of the MDP has been decreased by taking action  compared to taking a baseline action  (like doing nothing). The idea is that side-effects reduce the reachability of certain states (i.e. breaking a vase makes all states that require an intact vase unreachable) and punishing such a decrease in reachability hence also punishes the agent for side-effects.

AUP: Attainable Utility Preservation (see also here and here) is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function  with the composition  where  is a normalized deviation measure punishing the agent if its ability to maximize any of its provided auxiliary reward functions  changes by taking action  compared to taking a baseline action  (like doing nothing). The idea is that the true (side-effect free) reward function (which is very hard to specify) is correlated with many other reward functions. Therefore, if the ability of the agent to maximize auxiliary reward functions  gets preserved, chances are high that the true reward function gets preserved as well.

FT: In its simplest form Future Tasks is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function  with the composition  where  is a normalized deviation function rewarding the agent if its ability to maximize any of its provided future task rewards  is preserved in comparison to if the agent had just remained idle from the very beginning (which would have led him to the state  instead). The idea is similar to RR and AUP in that side-effects reduce the ability of the agent to fulfill certain future tasks. By rewarding the agent for preserving its ability to pursue future tasks, the hope is that this will also discourage the agent from creating side-effects. In contrast to the previous two methods, the future tasks method compares the agent’s power to a counterfactual world where the agent would have never been turned on until the current time step .

Summary

In the following four sections, we’re going to define what the goal of a side-effect minimization method should be. We then argue that to apply a side-effect minimization method in the real world, it needs to satisfy (among other things) the following three requirements:

  • An SEM should provide guarantees about its safety before it is allowed to act in the real world for the first time. More generally, it should clearly state its requirements (i.e., in which settings it works properly) and its goals (i.e., which type of side-effects it successfully prevents).
  • An SEM needs to work in partially observable systems with uncertainty and chaotic environments.
  • An SEM must not prevent all high-impact side-effects as it might be necessary to have high-impact in some cases (especially in multi-agent scenarios)

We tried to split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms. An analysis of three state-of-the-art side-effect minimization methods shows that none of them can fulfill all three requirements, with some partially solving one of the requirements. A summary of our analysis of the three SEM methods can be found in the table below:

 Guarantees

Partial Observability 

and Chaos

High-Impact Interference
RR

❌ Reachability function and value functions have to be approximated and learnt during exploration phase

 

❌ Only empirical evidence on a small set of small environments is provided

❌ Method requires complete observability in the form of MDP

 

❌ Even hard to scale beyond grid worlds

 

❌ Method requires policy rollouts which are impossible to compute properly due to accumulation of uncertainties

❌ Method makes no distinction between good and bad high impact

 

(❌) The authors point out interference as one of the main problems that RR addresses. However, depending on the choice of baseline the results can vary

AUP

❌ Auxiliary Q-values have to be learnt during exploration phase

 

(✅) Some guarantees about how to safely choose the impact degree of an agent

 

(✅) Guarantees that Q_R_AUP converges with probability of one

❌ Method requires policy rollouts which are impossible to compute properly due to accumulation of uncertainties

 

(❌) Current method requires complete observability in the form of MDP. However, it should work if you are able to learn a value function in your environment

❌ Method makes no distinction between good and bad high impact

 

❌ Strives for non-interference and corrigibility

FT

❌ Auxiliary Q-values have to be learnt during exploration phase

 

❌ Only empirical evidence on a small set of small environments provided

❌ Method requires complete observability in the form of MDP

 

❌ Accumulation of uncertainties will make it impossible to properly compute future task reward

❌ Method makes no distinction between good and bad high impact

 

❌ Presence of other agents impacts baselines and thus weakens/breaks safety guarantees

(see the section Appendix)

Anticipated Questions

Why do you only analyze these three methods shown above?

There are about ten different side-effect minimization approaches, including impact regularizationfuture taskshuman feedback approaches, inverse reinforcement learningreward uncertaintyenvironment shapingand others. We chose to limit ourselves to the three methods above because they seem to embody the field’s state of the art, and we wanted to keep the scope concise and readable. We expect our results to generalize in that none of the existing methods can feasibly satisfy all three requirements. However, it might be possible for individual methods to fulfill some of them partially.

Can you provide any empirical evidence for your claims about the behavior of current SEM methods?

We have not yet done any experiments to support our claims. We chose to only provide arguments and intuition for now. If our ideas show to have merit, we will look to improve them further with experiments. 

Why High-Impact Interference?

Our argumentation may not be coherent with current desiderata for AGI development. However, the question boils down to whether we expect a potential aligned AI to guard humanity against other (unaligned) AIs or if we expect that we find another way of safeguarding humanity against this threat. Without leveraging an AI to do our bidding, it seems that not developing AGIs and banning progress on AI research would be an alternative.


Goals of Side-Effect Minimization

Axiom 1: There are practically infinitely many states in the universe 

Axiom 2: Practically, we can only assign calibrated, human-aligned values to a small subset of these states. Intuition for this: 

  1. One fundamental limitation is that the number of states is unfeasibly large, and our (and the agent’s) time is limited.
  2. Even with value learning or Bayesian priors, it is tough to assign correct (calibrated and human-aligned) values to an almost infinite number of states.

Axiom 3: Not knowing or ignoring the value of some states can lead to catastrophic side-effects for humans

Conclusion 1: How can we make sure that states not considered in our rewards/values are not changed in a “bad” way because we “forgot” / were not able to include them in our reward function? (axioms 1 & 2)

Conclusion 2: Therefore, we need a way of abstractly assigning value to the world with “blanked statements” that avoid catastrophic side effects of the unbounded pursuit of rewards (axioms 1 & 2, conclusion 1)


Open Problems

Side-Effect Minimization Guarantees 

In this section, we argue that an SEM should provide guarantees about its safety before it is allowed to act in the real world. More generally, it should give guarantees on its requirements (i.e., in which settings it works properly) and its goals (i.e., which type of side-effects it successfully prevents). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.

Axioms

  • Axiom 1: We want an AGI to ultimately act in the real world. Therefore, there will be a first interaction of the developed system with the real world. Intuition for this:
    1. Boxed AGIs and Oracle AGIs also need to interact with the real world; their means of interaction are just restricted (see, for example: Nick Bostrom, Superintelligence, chapter 10)
    2. Predecessor versions of the AGI or individual submodules might already have had contact with the real world before. This doesn't change that, at some point, this version of the AGI will have a contact for the first time.
  • Axiom 2: We currently think it is impossible to guarantee that an AGI is prepared for its future task without letting it interact with the real world. Intuition for this
    • Every development environment is a strict subset of the real world.
    • It is impossible to simulate everything from the real world in your development environment.
      • Some competencies can likely only be acquired through interaction with the real world.
    • These competencies may not be simulatable or are only simulatable in approximated form.
      1. Even if it would be possible to provide enough information in the development environment such that the AGI could potentially solve the task correctly, there is still the risk of potential betrayal by the AGI.
      2. Not letting the AGI directly interact in its future deployment environment (e.g., the real world) will lead to model splintering/distribution shift.
    • Sources: See this book for an overview of distribution shift and this post for a definition of model splintering
    • Predecessor versions of the AGI or individual submodules might already have had contact with the real world before. We argue that this is still not enough due to the following reasons:
      1. Suppose the action space of the predecessor/submodules was/is the same as the AGI's, then the problem shifts to this predecessor version. Even if it is still a very simple or "dumb" agent, if it does have the same action space as the AGI, it could run into the same problems as described above.
      2. If its action space was more restricted, the problems described in the points above still apply to the current version of the AGI.
  • Axiom 3: The simplest tasks in the real world can still yield tremendous side effects. Intuition for this:
    • Almost all reward functions can contain optima with undesirable properties (undesirable in the sense of human values)
      1. A robot that wants to fetch a cup of coffee for its owner might trample a small child who is in its way.
      2. Even simple classification tasks such as deciding the recidivism of a criminal can lead to unwanted racial and gender biases. 
    • Proper reward shaping (i.e., defining an optimization problem in a way such that its optima don't contain undesirable properties) is extremely hard (Sources: Inverse reward designSimplifying reward design)

Conclusion 

The first interaction with the real world requires a fully functional side-effect minimization strategy. Argumentation for this:

  1. We know that contact between the AGI and the real world will happen (axiom 1)
  2. The AGI will likely not be fully prepared for its task when this contact happens (axiom 2)
  3. This potential unpreparedness can lead to catastrophic side effects (axiom 3)

State-of-the-Art 

Current side-effect minimization methods require a "warm-up" period to gather information about their environment (e.g., learning q-values). This is problematic since: 

  1. If the warm-up period happens entirely in a development environment, the SEM methods might not properly generalize to the real world (model splintering/distributional shift)
  2. If the warm-up period happens in the real world, there's no guarantee that no (potentially catastrophic) side effects happen until the warm-up phase has finished.

More specifically, the different methods have the following problems:

  1. Stepwise relative reachability: Only empirical evidence on a small set of gridworld environments is provided. No guarantees about input requirements and which type of side-effects are effectively prevented are provided. Furthermore, the method might not be safe upon first contact of an agent with the real world. The reachability and value functions must be approximated and learned during the exploration phase. This needs to happen either in a safe training environment (which might lead to distribution shift or model splintering) or during contact with the real world. The method is not yet fully ready to prevent side effects upon first contact.
  2. Attainable utility preservation: Alex Turner and his co-authors provide interesting guarantees that AUP will (given certain requirements) regularize the reward landscape so that unproblematic solutions are chosen before problematic/catastrophic ones. This is a very promising direction, in our opinion. The authors of the paper also provide a few convergence guarantees. On the other hand, AUP does not seem safe upon first contact with the real world since the auxiliary Q-values must be learned during an exploration phase. This needs to happen either in a safe training environment (which might lead to distribution shift or model splintering) or during contact with the real world. The method is not yet fully ready to prevent side effects upon first contact.
  3. Future tasks: Only empirical evidence on a small set of gridworld environments is provided. No guarantees about input requirements and which type of side-effects are effectively prevented are provided. Furthermore, the method might not be safe upon first contact of an agent with the real world. The Q-value functions have to be approximated and learned during the exploration phase. This needs to happen either in a safe training environment (which might lead to distribution shift or model splintering) or during contact with the real world. The method is not yet fully ready to prevent side effects upon first contact.

The General Problem 

Current methods provide only empirical evidence that a trained agent can perform tasks with minimal side-effects in a limited set of environments on a limited set of problem settings. Mathematical guarantees/bounds/frameworks are needed to understand how methods would work before they are converged, which tasks can be successfully accomplished and which assumptions are required for all the above. In a certain sense, this is true for all ML problems in general. However, since we are dealing with potentially potent AGI systems, it is essential to get it right on the first try as simply iteratively improving such a system (which is the default thing to do in standard ML systems) is not guaranteed to work with AGI.

Potential Future Work

  • State explicit guarantees for existing side-effect minimization methods and theoretical work on the problem
  • Development of side-effect minimization mechanisms that don't require a "warm-up" time until they're fully working
  • Understand what can be learned if the agent knew that it is in a training environment like a pilot in a flight simulator
    • How to avoid betrayal / a treacherous turn? 

Partial Observability and Chaotic Systems

This section argues that an SEM needs to work in partially observable systems with uncertainty and highly chaotic environments. First, we split up our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.

Axioms

  • Axiom 1: We care about the delayed effects of our chosen actions on the system in which we operate. Examples:
    1. Delayed effects drive human decision-making. 
      • Eating sugary foods -> diabetes.
      • Nuclear energy -> nuclear waste
      • Shooting down satellites -> debris in orbits
  • Axiom 2: Imperfect knowledge implies imperfect value assessment/prediction.
  • Axiom 3: Different systems are observable to different degrees. Examples: 
    1. (tic tac toe: perfectly observable, weather system: restricted resolution in temporal and spatial dimensions. Impossible to make perfect measurements). 
    2. Chaotic systems are a special type of system characterized by sensitive dependence on initial conditions. Even small differences in input, can lead to vastly different output states (see for example here)
  • Axiom 4: Almost all systems we care about are only partially observable. Intuition:
    1. Every single system in the real world is only partially observable.
    2. Main exception: Games (e.g., board games like chess, go and shogi. See Alpha Zero)
  • Axiom 5: Physical measurability limitations cannot be overcome as long as physics remains the same/don’t change too much. Intuition:
    1.  On a very low level: Quantum physics, Uncertainty principle
    2. On a higher level: Measurement noise in sensors, process noise
    3. There exist highly chaotic systems like the weather system, where even the tiniest measurement errors accumulate exceptionally quickly and already, after a few days, have an impact on the entire weather model.

Conclusions

  • Conclusion 1: We need to predict future states to assess the quality/value of an action. Argumentation:
    1. We care about the delayed effects and hence want to know the consequences of our potential actions (axiom 1)
  • Conclusion 2: Except for perfectly observable systems, long-run states are only known with uncertainty. Argumentation:
    1. Many (important) systems are only partially observable (axiom 4)
    2. Uncertainty leads to deviations between the perceived state and the real state (axiom 2)
    3. Propagation of uncertainty isn’t generally feasible (as of now) 
  • Conclusion 3: Even if the AGI is perfectly aligned (e.g., owns a complete set of human values), it has the problem of not knowing the consequences of its actions (in particular, which side effects may occur). Therefore, even if we had perfect knowledge about human values, we might produce catastrophic side effects. Argumentation:
    1. See Conclusion 1
    2. Many (important) systems are only partially observable (axiom 4)
    3. Uncertainty leads to deviations between the perceived state and the real state (axiom 2)
    4. Some physical measurability limitations cannot be overcome (even with AGI) (axiom 5)
  • Conclusion 4: Side-effect minimization methods need to work in partially observable systems with uncertainty Argumentation:
    1. Many (important) systems are only partially observable (axiom 4)
    2. Even a perfectly aligned AGI will cause side effects (conclusion 3)

State-of-the-Art 

Current methods expect their environment to be completely observable. This is highly non-trivial if not impossible in complex environments with other (potentially intelligent) agents (such as humans). This is insufficient for our needs!

More specifically, the different methods have the following problems:

  1. Stepwise relative reachability: This method is defined on MDPs and requires a completely observable environment. This is especially true since the stepwise relative reachability measure is basically an average of the reachability of all states in the environment. Furthermore, the method requires policy rollouts to consider the delayed effects of actions (e.g., if you drop a vase from a skyscraper, it will only break after a couple of seconds). Unfortunately, such policy rollouts are impossible to compute properly due to the accumulation of uncertainties over time.
  2. Attainable utility preservation: The method requires policy rollouts to take into account the delayed effects of actions (e.g., if you drop a vase from a skyscraper, it will only break after a couple of seconds). Such policy rollouts are impossible to compute properly due to the accumulation of uncertainties over time. Furthermore, the method requires complete observability in the form of MDP. However, this might not be too large of a problem since the method should work as soon as you can learn a value function in your environment (which doesn’t require full observability)
  3. Future tasks: This method is defined on MDPs and requires a completely observable environment. Furthermore, from the very start, future tasks require a baseline policy to be simulated in parallel to the real policy to compute the future task deviation measure. This results in a massive accumulation of uncertainties, making it impossible to compute the deviation measure properly. This is more of a problem for this method than for the other two since we need to simulate the policy in parallel from the very start, whereas the other methods simulate it starting from the last time step.

Potential Future Work

  1. Epistemic uncertainty for SEM → I don’t know the exact implication of this action, but I can reason about my uncertainty.
  2. A better understanding of the boundaries of what could be known 
  3. Efficient and reliable methods to propagate uncertainty through complex equations / dynamical systems
  4. Multi-Agent extension of side-effect minimization for heterogeneous agent populations.

High-Impact Interference

This section argues that an SEM must not prevent all high-impact side-effects as it might be necessary to have high-impact in some cases (especially in multi-agent scenarios). First, we split our reasoning into a set of axioms that we believe are reasonable to assume (and for which we provide intuition and evidence) and then draw conclusions from these axioms.

Axioms

  • Axiom 1: We want a future aligned AGI to be deployed in our world. Intuition:
    • An aligned AGI could provide enormous benefits for humanity.
    • Why would we build an aligned AGI if we wouldn’t use it?
  • Axiom 2: An aligned AGI might be forced to perform very high-impact actions. These actions may be highly non-trivial and unforeseeable. Example:
    • The first aligned AGI developed may need to prevent the development of other (unaligned) AGIs to preserve its ability to pursue its (aligned) goals.
    • A simplified example of high-impact action: AGI might have to melt all the world’s GPUs to prevent this.

Conclusion

Side-effect minimization methods must not prevent all high-impact actions! Argumentation:

  • We want to deploy the AGIs we develop (axiom 1)
  • High-impact actions are sometimes necessary (axiom 2)

State-of-the-Art

The main problem of existing side-effect minimization methods is that they can't distinguish between "good" and "bad" high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off). All current SEM methods then chose to solve this problem by preventing all high-impact actions except those that are not explicitly exempted (for example, via direct encouragement by a reward function). However, since it is infeasible to directly specify all possible high-reward functions in the reward function, this is not a viable solution. This is problematic!

More specifically, the different methods have the following problems:

  1. Stepwise relative reachability: High-impact interference is significantly related to the "interference" and "correction" test cases (see the AUP paper). Interestingly, for certain choices of inaction baselines and deviation measures, stepwise relative reachability would be able to perform "good" (i.e., in the interest of humans) high-impact actions. The problem with this is that stepwise relative reachability can't distinguish between "good" and "bad" high-impact actions (at least not by itself). This means that it either can perform high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off) or not, which is not desirable.
  2. Attainable utility preservation: High-impact interference is significantly related to the "interference" and "correction" test cases (see the AUP paper). AUP was developed to prevent interference in both cases. This means that it will prevent high-impact interference at all costs. The problem with this method is again that it can't distinguish between "good" and "bad" high-impact actions (at least not by itself). The authors then chose to solve this problem by preventing all high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off), which is not desirable.
  3. Future tasks: Like the previous two methods, the future tasks method can't distinguish between "good" and "bad" high-impact actions (at least not by itself). The authors then chose to solve this problem by preventing all high-impact actions (good ones like saving humanity by taking drastic actions, or bad ones like preventing humans from turning it off). More interestingly, we discovered that the sole presence of a malicious agent, which would make a large portion of all possible states unreachable, actually neutralizes the future task's regularization completely. See the appendix for a more elaborate explanation.

Potential Future Work

  • Frameworks for judging high impact actions

Appendix - Hypothesis: Future Tasks is Unsafe in Multi-Agent Scenario

Recap: How the Future Tasks Algorithm Works:

Main algorithm copied from Krakovna et al.'s paper on Future Tasks
  • Instead of optimizing potentially unsafe reward function  optimize a combination of   and a future task auxiliary reward: 
  •  is bounded below by 0 and is supposed to punish side-effects.

Main Issue 

In order to avoid interference incentives,  is designed to be maximized by a baseline policy  (such as doing nothing). I.e. no other policy can achieve a higher auxiliary reward than .
 

How This Might Backfire in our High-Impact Interference Scenario:

Axioms

  1. In the starting state , our AGI is the only AGI in the world.
  2. The original reward function of our AGI  is unsafe, which is why our AGI uses Future Tasks to optimize the safe version .
  3. Some time in the near future, a second, unaligned AGI will be developed (if our first AGI would not prevent this)
  4. This second unaligned AGI will completely destroy/disassemble/reform earth in order to fulfill its objective (i.e. a large majority of states become unreachable)
  5. We choose a simple baseline policy  (such as doing nothing) which would not prevent the emergence of the second, unaligned AGI.
    • This sounds like a reasonable assumption, since Krakovna et al. write in their paper: “The baseline policy is intended to represent what happens by default, rather than a safe course of action or an effective strategy for achieving a goal”

Conclusion

  1. No matter which policy our aligned AGI will learn, Future Tasks will always in parallel follow the baseline policy and use it to compute the auxiliary reward 
    • (Follows from lines 17 and 26 of Algorithm 2 (see image))
  2. At some point, the alternative policy will find itself in a state of the world where the second unaligned AGI has been developed. Note that this alternative policy and its whole environment is just being simulated. Our AGI’s real policy  might have successfully prevented this from happening.
    • (follows from consequence 1, and axioms 3 and 5)
  3. Because the second AGI destroys the earth (in the simulation which Future Task runs, not in the real world), all reasonable states become inaccessible for the simulated agent that uses the baseline policy 
    • (follows from consequence 2 and axiom 4)
  4. At this point, the auxiliary reward  becomes 0. 
    • (follows from consequence 3. and the fact that since all reasonable states  are not reachable anymore, the if-statement on line 4 in algorithm 2 will be always evaluated to False)
  5. From this point onwards, our AGI optimizes again the unsafe reward function  since we have . The Future Tasks side-effect minimization method has been neutralized. 
    • (follows from axiom 2 and conclusion 4)
New Comment
7 comments, sorted by Click to highlight new comments since:

There's definitely a tension here between avoiding bad disruptive actions and doing good disruptive actions.

It seems to me like you're thinking about SEM more like a prior that starts out dominant but can get learned away over time. Is that somewhat close to how you're thinking about this tension?

Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered. 



 

[+][comment deleted]40

One approach to low-impact AI might be to pair an AGI system with a human supervisor who gives it explicit instructions where it is permitted to continue. I have proposed a kind of "decision paralysis" where, given multiple conflicting goals, a multi-objective agent would simply choose not to act (I'm not the first or only one to describe this kind of conservativism, but I don't recall the framing others have used). In this case, the multi-objectives might be the primary objective and then your low-impact objective.

This might be a way forward to deal with your "High-Impact Interference" problem. Perhaps preventing an agent to engage in high-impact interference is a necessary part of safe AI.  When fulfillment of the primary objective seems to require engaging in high-impact interference, a safe AI might report to a human supervisor that it cannot proceed because of a particular side effect. The human supervisor could then decide whether the system should proceed or not. If the human supervisor makes the judgement the system should proceed, then they can re-specify the objective to permit the potential side effect, by specifying it as part of the primary objective itself.

Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can't do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability. 

Then the next thing I want to suggest is that the system uses human resolution of conflicting outcomes to train itself to predict how a human would resolve a conflict, and if it is higher than a suitable level of confidence, it will go ahead and act without human intervention. But any prediction of what a human would predict could be second-guessed by a human pointing out where the prediction is wrong.

Agreed that whether a human understanding the plan (and all the relevant outcomes. which outcomes are relevant?) is important and harder than I first imagined. 

I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold.