Summary of AI Research Considerations for Human Existential Safety (ARCHES)

peterbarnett

I read Andrew Critch and David Krueger's AI Research Considerations for Human Existential Safety (ARCHES) recently, which is a report that outlines and discusses 29 research directions which might be helpful for AI existential safety. This has also been discussed in Alignment Newsletter #103 and in an interview with Andrew Critch on the Future of Life Podcast.

This report focuses on "prepotent" AI, which is defined as follows: An AI system is prepotent if its development would transform the state of humanities habitat - currently the Earth - in a manner that is at least as impactful as humanity and is unstoppable to humanity. Such an AI system would be as transformative to Earth as the entire effect of humanity so far, including the agricultural and industrial revolutions, and would be unstoppable in the sense that no group of humans could stop or reverse its impact even if they wanted to. Prepotent AI is Transformative AI (a term used by Open Philanthropy), which is unstoppable to humans.

As well as being prepotent, an AI technology would need to be misaligned for it to be bad, and hence this report talks a lot about misaligned prepotent AI (MPAI). When it comes to the case where there are multiple stakeholders it is hard to define what alignment means, and so this report uses human extinction as a clear (but maybe overly restrictive) boundary for alignment; A prepotent AI system is misaligned if it is unsurvivable to humanity. This definition obviously misses some very bad outcomes, but is justified under what is called the Human Fragility Argument which says that most potential future states of the Earth are unsurvivable to humanity (for example if all the oxygen in the atmosphere was removed), and hence unless we have properly aligned a prepotent AI system then its deployment and subsequent transformative actions will likely result in human extinction.

The report is open about the things which are omitted, which include; bad outcomes which don't result in human extinction; the definition of a 'human'; the definition of 'benefical'; the impact of AI making other existential risks more likely (nuclear weapons, bioweapons etc).

This report outlines 29 research directions with the goal of reducing the chance that MPAI will be deployed. This report frames the human/AI relationship as one where humans delegate an AI to do a task, which avoids having to talk about AI's having a perspective or desires other than the human desires. The report divides the research directions based on the number of human stakeholders and AI systems involved. These are labeled with the number of humans then AI (because humans come first, both historically and morally), ie.

Single/single: Single human stakeholder, single AI system
Single/multi: Single human stakeholder, multiple AI systems
Multi/single: Multiple human stakeholders, single AI system
Multi/multi: Multiple human stakeholders, multiple AI systems

From the single/single case, multiple stakeholders can quickly arise by new people trying to control part of the system or from disagreements among existing stakeholders. There will also be strong incentives to copy a single AI system, and so a multi/multi situation may quickly arise.

Each of these categories is then divided into three things human capabilities needed for successful human/AI delegation

Comprehension: the human ability to understand how an AI system works and what it will do
Instruction: the human ability to give instructions to an AI system
Control: the human ability to retain or regain control of a situation involving an AI

There are two tiers of risks which this report considers; Tier 1 risks lead directly to MPAI deployment events, while Tier 2 risks are conditions which increase the chances of Tier 1 risks (these could also be called risk factors).

The Tier 1 risks are an exhaustive list of ways MPAI could be deployed

Uncoordinated MPAI development: No single development team is primarily responsible for the development of MPAI, but it is developed and deployed partially due to coordination failures
Unrecognized prepotence: Prior to deployment, the developers didn't realize that the technology was prepotent or could become prepotent.
Unrecognized misalignment: Prior to deployment, the developers didn't realize that the technology was misaligned or could become misaligned.
Involuntary MPAI deployment: MPAI is deployed without the voluntary permission from the developers; for example accidental release, or release by hackers.
Voluntary MPAI deployment: MPAI is voluntarily deployed, for example due to indifference to human extinction or malice.

The Tier 2 risks are not an exhaustive list of risk factors

Unsafe development races, where the pressure to develop powerful AI causes teams to neglect safety considerations
Economic displacement of humans, where AI systems have overwhelming economic power and humans have no power to bid for human values
Human enfeeblement, where humans become physically or mentally weaker due to the actions of AI systems
AI Existential Safety discourse impairment, where it becomes harder to have good discussion about AI risks and safety

This report repeatedly stresses the importance of considering whether research could be harmful. It seems likely that if some single/single problems are solved, then this could speed up deployment of powerful AI systems This could lead to a multi/multi scenario which has been poorly prepared for. Additionally many of the research directions are potentially 'dual use', where the research could directly increase the chances of MPAI deployment. For example work on allowing an AI system to model human decision making processes could help it to better work with humans to achieve the humans' preferences, but it could also allow the AI system to more effectively manipulate humans in an undesirable way.

The report now outlines 29 research directions in Chapters 5, 6, 8, and 9. Chapter 7 is after the single human research directions, and outlines considerations relevant to AI systems controlled by multiple stakeholders. For each of the research direction sections there are useful subheadings which are often (but not always) repeated

Social analogue: these are post-hoc analogies for understanding the scenario, for example rather than talking about a human delegating to an AI they may talk about a boss delegating to an employee. These seemed really helpful for understanding the gist of what was going on.
Scenario-driven motivation: these are more direct explanations of how the research direction could help reduce existential risk.
Instrumental motivation: How this research direction could help with other research directions, rather than directly reduce existential risk.
Actionability: Discussion of existing work relevant to the research direction
Consideration of side effects: Ways in which the research direction could be harmful, such as the research being 'dual use'.
Historical note: Examples from history relevant to the research direction.

Here I will attempt to give a short description of the research directions. My aim here is to briefly say what the research directions are, not to give a detailed overview; reading the relevant section of the report is probably best for that.

Single/Single Delegation research

A single human stakeholder delegating to a single AI system.

Single/Single Comprehension

Direction 1: Transparency and explainability

Developing methods for looking at the inner workings of an AI system (transparency) and for explaining why it makes decisions in a way which is legible to humans (explainability

Direction 2: Calibrated confidence reports

Developing AI systems which express accurate probabilistic confidence rates for answering questions of choosing good actions. For example, the system should be properly calibrated such that answers assigned 90% probability of being correct actually are correct 90% of the time. A well calibrated system should have a good idea of when to shut itself off or trigger human intervention.

Direction 3: Formal verification of ML systems

Using formal proof/argument verification methods for ML systems. Ideally it would be good to have a formal proof that a powerful AI system was not misaligned before we deploy it.

Direction 4: AI-assisted deliberation (AIAD)

Using an AI system to assist humans to reflect on information and arrive at decisions which they reflectively endorse.

Direction 5: Predictive models of bounded rationality

Like humans, AI systems have limited computation, and so it would be good to have a model of what kinds of decisions are easy enough or too hard for an AI system with a given amount of computation.

Single/Single Instruction

Direction 6: Preference learning

Ensuring an AI system can learn hot to act in accordance with the preferences of another system, such as a human stakeholder. This is mainly useful for ensuring that a system is aligned, and therefore reduces risks due to unrecognized misalignment.

Direction 7: Human belief inference

An AI system may potentially interact better with humans if it can accurately infer the humans' beliefs. For example this can include inferences about information a human does or does not know, and how to determine human beliefs from human actions.

Direction 8: Human cognitive models

Using a mathematical or computational model of human cognition to allow an AI system to better interact with humans.

Single/Single Control

Direction 9: Generalizable shutdown and handoff methods

Designing methods such that an AI system can safely shutdown or hand off tasks back to a human. For example, the auto pilot of an airplane should not be turned off until control has been safely handed back to the pilot. A safe shutdown could be operationalized as "entering a state from which a human controller can proceed safely".

Direction 10: Corrigibility

A corrigible AI "cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut down or modify their procedures".

Direction 11: Deference to humans

The aim of this research direction is to design AI systems which defer to humans in certain situations, even when the AI believes it has a better understanding of the correct action or of what humans will later prefer. There are also times when an AI system does know better than an AI and should not defer to a single human about an important decision.

Direction 12: Generative models of open-source equilibria

This research direction explores the game-theoretic implications of players in a game being able to inspect the internal 'thought processes' of the other players. For example, humans may have the ability to inspect the inner workings of an AI, and this ability may change how the AI acts.

Single/Multi Delegation Research

Single human stakeholder delegating to multiple separated AI systems.

Single/Multi Comprehension

Direction 13: Rigorous coordination models

Can we develop an analogous model to the Von Neumann-Morgenstern (VNM) theorem for a multi agent system with a single goal? Such a theorem should include communications between the agents, and constraints for the objective functions of single agents. The human stakeholder could be included as one of the agents, rather than just considering the AI systems.

Direction 14: Interpretable machine language

Similarly to how we want to understand the inner workings of single AI systems, it would also be useful to understand the communications between separate AI systems.

Direction 15: Relationship taxonomy and detection

For multi-agent systems there can often emerge different relationships such as 'cooperative' or 'competitive'. It would be good to classify these and other types of relationships, and have methods for detecting them.

Direction 16: Interpretable hierarchical reporting

For large problems, AI systems may be arranged into hierarchies, and the problems may be split into smaller parts. AI systems in the middle of the hierarchy could provide human readable reports, and ask questions to humans and other AI systems in the hierarchy.

Single/Multi Instruction

Direction 17: Hierarchical human-in-the-loop learning (HHL)

In a system with multiple AI subsystems it may be useful for subsystems to query humans or other subsystems when they don't have enough information. HHL is about setting up a hierarchy of AI systems which can do this, and is compatible with human oversight at each level.

Direction 18: Purpose inheritance

An AI system has a heritable purpose if it creates descendants who's actions also benefit the original system's purpose. Descendants may be created in various ways; the AI may have an internal planning process which selects actions to create a descendant; the AI may be ordered (by a human or AI) to create descendants; or a descendant may arise accidentally due to other reasons. Even if a system is not misaligned, it is also important to ensure that its descendants are also not misaligned.

Direction 19: Human-compatible ethics learning

Human-favorable norms may be able to be derived from more fundamental ethical abstractions such as loyalty or reciprocity. This may be used to apply constraints to AI systems, even if these constraints violate VNM rationality.

Direction 20: Self-indication uncertainty

It will be possible to copy and reimplement AI systems in different environments. It is possible that powerful AI should have some awareness of this, and for example know that it may be being used by hackers in a deliberately corrupted environment. Additionally, if the system thinks there is some chance it is in the test environment it may be more likely to comply with shut down commands.

Single/Multi Control

Here multiple research directions from Single/Single Control can be generalized to the case with multiple AI systems: Generalizable shutdown and handoff methods, Corrigibility, Deference to humans. However, there will likely be novel failure modes which arise from having multiple AI systems, even if the single/single case is safe.

Unsafe composition of local shutdown protocols

Even if the shutdown protocol is safe for a single agent, it may not be safe when multiple agents are shut off. As an example, a single doctor can take the day off, but not all the doctors at once.

Malfunctioning of local shutdown protocols

If some/most of the agents in a system successfully shut down, any remaining agents may be unsafe outside of the context of the original system. This seems especially true when the agents which shut down were responsible for safety or stability of the system.

Multi-Stakeholder Objectives

Before we move onto research directions concerned with multiple stakeholders this section of the report discusses what it means for an AI system to be aligned or misaligned when there are multiple stakeholders. When there are multiple stakeholders with different views, it seems hard to define what it would mean for a system to be 'aligned'. However, from the perspective of humanity, it seems like human extinction would be a bad outcome; therefore from hear onward it is easier to define misalignment rather than alignment for the multi stakeholder scenario.

The report lists 4 multi-stakeholder objectives which serve as high level guidelines for the multi/single and multi/multi delegation research.

Objective 1: Facilitating collaborative governance

This objective is to make it easier for diverse stakeholders to collaborate in the oversight of AI technologies, by developing AI technology alongside governance techniques which capture the benefits of collaboration while avoiding forms of collaboration which would be unsafe or unnecessarily costly compared to independent governance. Institutions may be motivated to collaborate to ensure their views are represented, to pool knowledge and resources, to reduce the chances of existential risk.

Objective 2: Avoiding races by sharing control

This objective is to make collaborative oversight of AI systems sufficiently easy and appealing as to reduce first mover advantages which could lead to unsafe development races.

Objective 3: Reducing idiosyncratic risk taking

This objective aims to co-develop AI and governance techniques which enable governing stakeholders to collaborate on mitigating the idiosyncratic tendencies of each stakeholder, while still enabling the group to take broadly agreeable risks.

Objective 4: Existential safety systems

This objective is to develop existential safety systems, which are any somewhat autonomous organizational unit (with humans and/or automated tools) which has the purpose protecting global public safety at the scale of existential risk.

Multi/Single Delegation Research

Multiple human stakeholders (e.g. different companies or members of a committee) delegating to single AI system.

Multi/Single Comprehension

Many of the single/single comprehension strategies will be useful for the multi/single case, additionally there is at least one specific research direction for dealing with multiple stakeholders.

Direction 21: Privacy for operating committees

A committee charged with operating/overseeing a powerful AI system should be able to discuss the system without being observed by the system. For example, if members of the committee are worried an AI system is misaligned, they may be worried about discussing it because the AI system may find out and take actions to avoid being shut down.

Multi/Single Instruction

Direction 22: Modeling human committee deliberation

A system whose purpose it to serve a committee may need to model not only the preferences of the individual committee members, but also the dynamics of the deliberation process. In some sense this is diametrically opposed to research direction 21 (privacy for operating committees), which emphasizes how difficult it can be to ensure research is beneficial and hence how careful we must be.

Direction 23: Moderating human belief disagreements

Developing a service for assisting in the resolution of belief disagreements between conflicting stakeholders. If stakeholders are more able to reach agreement about their beliefs, then this seems beneficial for all of the multi-stakeholder objectives.

Direction 24: Resolving planning disagreements

There may be cases in which belief or value differences between stakeholders can't be reconciled, in which case it would be useful to be able to formulate plans which the multiple stakeholders can still agree on. This again seems beneficial for all of the multi-stakeholder objectives.

Multi/Single Control

Like in the single/single control case, the overseers of an AI system should retain the ability to shut down or override the system. For this different committee structures could be used, for example; only shut down if there is a consensus, or allow all members the unilateral authority to shut down.

Direction 25: Shareable execution control

This research direction explores ways in which multiple stakeholders each have some control over an AI system's execution. A basic example of this would be a system where all committee members need to enter their passwords before execution.

Multi/Multi Delegation Research

Multiple human stakeholders delegating to multiple different AI systems.

Multi/Multi Comprehension

Direction 26: Capacity oversight criteria

This research direction is concerned with the identification of criteria describing what capacities might be needed for any given research group to develop prepotent AI. Such criteria could be used to define auditing requirements which governments or regulators could use to ensure development teams were acting safely and ethically.

Multi/Multi Instruction

Direction 27: Social contract learning

This research direction is focused on enabling AI systems to respect a 'social contact' with multiple stakeholders, including existing institutions such as states, businesses or even all of human civilization. Even if a system is technically acting within the law, there may be unethical actions which a social contract could prevent.

Multi/Multi Control

Direction 28: Reimplementation security

This direction is concerned with individuals modifying or reimplementing AI systems in a multi/multi scenario, in cases were these changes could break the safety of the overall interacting system. Without proper security, unsafe reimplementation by careless or malicious actors could cause all Tier 1 risks.

Direction 29: Human-compatible equilibria

This final research direction is concerned with developing a more realistic understanding of the game-theoretic equilibria where some agents are humans. We want to guarantee that the human agents will not be destroyed or replaced.

Common Side Effects

Most, if not all, of the research directions have a section on side effects which could result from the research. These are often very specific to the research, but there are also some common themes.

The most common potential side effect is the research resulting in a false sense of security. Even if a problem is solved for a weak system or a single system, the problem may not be solved if the system becomes prepotent or in a scenario where there are multiple AI systems. For directions concerned with proofs and formal verification, a result may hold for strict assumptions but not work when more realistic assumptions are used.

A few of the research directions are concerned with allowing the AI to accurately model human decision making and cognition. These directions all have the effect that they may make it easier for an AI system to manipulate humans. Additionally, if an AI can gain a lot of information just from observing a human's actions, this may allow it to rapidly learn a lot of information which may lead to prepotence.

The comprehension research directions often involved the AI system(s) producing a human readable about the internal workings. For these directions to be useful, the reports must be an accurate reflection of the system's decision-making process, rather than just a post-hoc rationalization of its actions. Also, for these techniques to be useful the humans reading the reports have to believe them, and must be comfortable/able to talk about the risks from AI systems. Therefore if discourse around AI existential safety is impaired, these comprehension techniques become less useful.

Additional Thoughts

In this report there seem to be two main types of research directions about developing AI; directions which are used for directly building safe AI, and directions for building AI tools which will help us safely develop and use powerful AI. For example, most of the single/single research directions (e.g. transparency and explainability, calibrated confidence reports, preference learning, etc) seem applicable for building safe AI. While towards the end of the report (especially in the multiple stakeholder scenarios) there are more directions related to building AI systems which will hopefully lead to humans making good decisions (e.g. modeling human committee deliberation, moderating human belief disagreements). Both these directions are about developing AI, but it seems useful to consider whether the aim of a research direction is to directly make a safe AI, or to make sure human stakeholders make good decisions around the use of AI. There are also research directions which fit partially into both of these categories or don't fit into either.

In a similar way, as the report progresses from the single/single research directions to multi/multi, the directions move from more concrete ideas related to computer science, and towards broader questions. These broader questions are more general, and cover areas related to decision making and group dynamics which are also relevant to fields outside of AI safety. This is probably to be expected, because almost all the AI safety work has so far been done on single/single delegation scenarios so the current research will dig deeper into specific questions. The more complicated multi agent scenarios have so far had less attention paid to them, and so the research is at an earlier stage and there are fewer specific questions to focus on.