Some AI research areas and their relevance to existential safety

Andrew_Critch

Followed by: What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs), which provides examples of multi-stakeholder/multi-agent interactions leading to extinction events.

Introduction

This post is an overview of a variety of AI research areas in terms of how much I think contributing to and/or learning from those areas might help reduce AI x-risk. By research areas I mean “AI research topics that already have groups of people working on them and writing up their results”, as opposed to research “directions” in which I’d like to see these areas “move”.

I formed these views mostly pursuant to writing AI Research Considerations for Human Existential Safety (ARCHES). My hope is that my assessments in this post can be helpful to students and established AI researchers who are thinking about shifting into new research areas specifically with the goal of contributing to existential safety somehow. In these assessments, I find it important to distinguish between the following types of value:

The helpfulness of the area to existential safety, which I think of as a function of what services are likely to be provided as a result of research contributions to the area, and whether those services will be helpful to existential safety, versus
The educational value of the area for thinking about existential safety, which I think of as a function of how much a researcher motivated by existential safety might become more effective through the process of familiarizing with or contributing to that area, usually by focusing on ways the area could be used in service of existential safety.
The neglect of the area at various times, which is a function of how much technical progress has been made in the area relative to how much I think is needed.

Importantly:

The helpfulness to existential safety scores do not assume that your contributions to this area would be used only for projects with existential safety as their mission. This can negatively impact the helpfulness of contributing to areas that are more likely to be used in ways that harm existential safety.
The educational value scores are not about the value of an existential-safety-motivated researcher teaching about the topic, but rather, learning about the topic.
The neglect scores are not measuring whether there is enough “buzz” around the topic, but rather, whether there has been adequate technical progress in it. Buzz can predict future technical progress, though, by causing people to work on it.

Below is a table of all the areas I considered for this post, along with their entirely subjective “scores” I’ve given them. The rest of this post can be viewed simply as an elaboration/explanation of this table:

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Out of Distribution Robustness	Zero/ Single	1/10	4/10	5/10	3/10	1/10
Agent Foundations	Zero/ Single	3/10	8/10	9/10	8/10	7/10
Multi-agent RL	Zero/ Multi	2/10	6/10	5/10	4/10	0/10
Preference Learning	Single/ Single	1/10	4/10	5/10	1/10	0/10
Side-effect Minimization	Single/ Single	4/10	4/10	6/10	5/10	4/10
Human-Robot Interaction	Single/ Single	6/10	7/10	5/10	4/10	3/10
Interpretability in ML	Single/ Single	8/10	6/10	8/10	6/10	2/10
Fairness in ML	Multi/ Single	6/10	5/10	7/10	3/10	2/10
Computational Social Choice	Multi/ Single	7/10	7/10	7/10	5/10	4/10
Accountability in ML	Multi/ Multi	8/10	3/10	8/10	7/10	5/10

The research areas are ordered from least-socially-complex to most-socially-complex. This roughly (though imperfectly) correlates with addressing existential safety problems of increasing importance and neglect, according to me. Correspondingly, the second column categorizes each area according to the simplest human/AI social structure it applies to:

Zero/Single: Zero-human / Single-AI scenarios
Zero/Multi: Zero-human / Multi-AI scenarios
Single/Single: Single-human / Single-AI scenarios
Single/Multi: Single-human / Multi-AI scenarios
Multi/Single: Multi-human / Single-AI scenarios
Multi/Multi: Multi-human / Multi-AI scenarios

Epistemic status & caveats

I developed the views in this post mostly over the course of the two years I spent writing and thinking about AI Research Considerations for Human Existential Safety (ARCHES). I make the following caveats:

These views are my own, and while others may share them, I do not intend to speak in this post for any institution or group of which I am part.
I am not an expert in Science, Technology, and Society (STS). Historically there hasn’t been much focus on existential risk within STS, which is why I’m not citing much in the way of sources from STS. However, from its name, STS as a discipline ought to be thinking a lot about AI x-risk. I think there’s a reasonable chance of improvement on this axis over the next 2-3 years, but we’ll see.
I made this post with essentially zero deference to the judgement of other researchers. This is academically unusual, and prone to more variance in what ends up being expressed. It might even be considered rude. Nonetheless, I thought it might be valuable or at least interesting to stimulate conversation on this topic that is less filtered through patterns deference to others. My hope is that people can become less inhibited in discussing these topics if my writing isn’t too “polished”. I might also write a more deferent and polished version of this post someday, especially if nice debates arise from this one that I want to distill into a follow-up post.

Defining our objectives

In this post, I’m going to talk about AI existential safety as distinct from both AI alignment and AI safety as technical objectives. A number of blogs seem to treat these terms as near-synonyms (e.g., LessWrong, the Alignment Forum), and I think that is a mistake, at least when it comes to guiding technical work for existential safety. First I’ll define these terms, and then I’ll elaborate on why I think it’s important not to conflate them.

AI existential safety (definition)

In this post, AI existential safety means “preventing AI technology from posing risks to humanity that are comparable to or greater than human extinction in terms of their moral significance.”

This is a bit more general than the definition in ARCHES. I believe this definition is fairly consistent with Bostrom’s usage of the term “existential risk”, and will have reasonable staying power as the term “AI existential safety” becomes more popular, because it directly addresses the question “What does this term have to do with existence?”.

AI safety (definition)

AI safety generally means getting AI systems to avoid risks, of which existential safety is an extreme special case with unique challenges. This usage is consistent with normal everyday usage of the term “safety” (dictionary.com/browse/safety), and will have reasonable staying power as the term “AI safety” becomes (even) more popular. AI safety includes safety for self-driving cars as well as for superintelligences, including issues that these topics do and do not share in common.

AI ethics (definition)

AI ethics generally refers to principles that AI developers and systems should follow. The “should” here creates a space for debate, whereby many people and institutions can try to impose their values on what principles become accepted. Often this means AI ethics discussions become debates about edge cases that people disagree about instead of collaborations on what they agree about. On the other hand, if there is a principle that all or most debates about AI ethics would agree on or take as a premise, that principle becomes somewhat easier to enforce.

AI governance (definition)

AI governance generally refers to identifying and enforcing norms for AI developers and AI systems themselves to follow. The question of which principles should be enforced often opens up debates about safety and ethics. Governance debates are a bit more action-oriented than purely ethical debates, such that more effort is focussed on enforcing agreeable norms relative to debating about disagreeable norms. Thus, AI governance, as an area of human discourse, is engaged with the problem of aligning the development and deployment of AI technologies with broadly agreeable human values. Whether AI governance is engaged with this problem well or poorly is, of course, a matter of debate.

AI alignment (definition)

AI alignment usually means “Getting an AI system to {try | succeed} to do what a human person or institution wants it to do”. The inclusion of “try” or “succeed” respectively creates a distinction between intent alignment and impact alignment. This usage is consistent with normal everyday usage of the term “alignment” (dictionary.com/browse/alignment) as used to refer to alignment of values between agents, and is therefore relatively unlikely to undergo definition-drift as the term “AI alignment” becomes more popular. For instance,

(2002) “Alignment” was used this way in 2002 by Daniel Shapiro and Ross Shachter, in their AAAI conference paper User/Agent Value Alignment, the first paper to introduce the concept of alignment into AI research. This work was not motivated by existential safety as far as I know, and is not cited in any of the more recent literature on “AI alignment” motivated by existential safety, though I think it got off to a reasonably good start in defining user/agent value alignment.
(2014) “Alignment” was used this way in the technical problems described by Nate Soares and Benya Fallenstein in Aligning Superintelligence with Human Interests: A Technical Research Agenda. While the authors’ motivation is clearly to serve the interests of all humanity, the technical problems outlined are all about impact alignment in my opinion, with the possible exception of what they call “Vingean Reflection” (which is necessary for a subagent of society thinking about society).
(2018) “Alignment” is used this way by Paul Christiano in his post Clarifying AI Alignment, which is focussed on intent alignment.

A broader meaning of “AI alignment” that is not used here

There is another, different usage of “AI alignment”, which refers to ensuring that AI technology is used and developed in ways that are broadly aligned with human values. I think this is an important objective that is deserving of a name to call more technical attention to it, and perhaps this is the spirit in which the “AI alignment forum” is so-titled. However, the term “AI alignment” already has poor staying-power for referring to this objective in technical discourse outside of a relatively cloistered community, for two reasons:

As described above, “alignment” already has a relatively clear technical meaning that AI researchers have already gravitated towards interpreting “alignment” to mean, that is also consistent with natural language meaning of the term “alignment”, and
AI governance, at least in democratic states, is basically already about this broader problem. If one wishes to talk about AI governance that is beneficial to most or all humans, “humanitarian AI governance” is much clearer and more likely to stick than “AI alignment”.

Perhaps “global alignment”, “civilizational alignment”, or “universal AI alignment” would make sense to distinguish this concept from the narrower meaning that alignment usually takes on in technical settings. In any case, for the duration of this post, I am using “alignment” to refer to its narrower, technically prevalent meaning.

Distinguishing our objectives

As promised, I will now elaborate on why it’s important not to conflate the objectives above. Some people might feel that these arguments are about how important these concepts are, but I’m mainly trying to argue about how importantly different they are. By analogy: while knives and forks are both important tools for dining, they are not usable interchangeably.

Safety vs existential safety (distinction)

“Safety” is not robustly usable as a synonym for “existential safety”. It is true that AI existential safety is literally a special case of AI safety, for the simple reason that avoiding existential risk is a special case of avoiding risk. And, it may seem useful for coalition-building purposes to unite people under the phrase “AI safety” as a broadly agreeable objective. However, I think we should avoid declaring to ourselves or others that “AI safety” will or should always be interpreted as meaning “AI existential safety”, for several reasons:

Using these terms as synonyms will have very little staying power as AI safety research becomes (even) more popular.
AI existential safety is deserving of direct attention that is not filtered through a lens of discourse that confuses it with self-driving car safety.
AI safety in general is deserving of attention as a broadly agreeable principle around which people can form alliances and share ideas.

Alignment vs existential safety (distinction)

Some people tend to use these terms as as near-synonyms, however, I think this usage has some important problems:

Using “alignment” and “existential safety” as synonyms will have poor staying-power as the term “AI alignment” becomes more popular. Conflating them will offend both the people who want to talk about existential safety (because they think it is more important and “obviously what we should be talking about”) as well as the people who want to talk about AI alignment (because they think it is more important and “obviously what we should be talking about”).
AI alignment refers to a cluster of technically well-defined problems that are important to work on for numerous reasons, and deserving of a name that does not secretly mean “preventing human extinction” or similar.
AI existential safety (I claim) also refers to a technically well-definable problem that is important to work on, and deserving of a name that does not secretly mean “getting systems to do what the user is asking”.
AI alignment is not trivially helpful to existential safety, and efforts to make it helpful require a certain amount of societal-scale steering to guide them. If we treat these terms as synonyms, we impoverish our collective awareness of ways in which AI alignment solutions could pose novel problems for existential safety.

This last point gets its own section.

AI alignment is inadequate for AI existential safety

Around 50% of my motivation for writing this post is my concern that progress in AI alignment, which is usually focused on “single/single” interactions (i.e., alignment for a single human stakeholder and a single AI system), is inadequate for ensuring existential safety for advancing AI technologies. Indeed, among problems I can currently see in the world that I might have some ability to influence, addressing this issue is currently one of my top priorities.

The reason for my concern here is pretty simple to state, via the following two diagrams:

Of course, understanding and designing useful and modular single/single interactions is a good first step toward understanding multi/multi interactions, and many people (including myself) who think about AI alignment are thinking about it as a stepping stone to understanding the broader societal-scale objective of ensuring existential safety.

However, this pattern mirrors the situation AI capabilities research was following before safety, ethics, and alignment began surging in popularity. Consider that most AI (construed to include ML) researchers are developing AI capabilities as stepping stones toward understanding and deploying those capabilities in safe and value-aligned applications for human users. Despite this, over the past decade there has been a growing sense among AI researchers that capabilities research has not been sufficiently forward-looking in terms of anticipating its role in society, including the need for safety, ethics, and alignment work. This general concern can be seen emanating not only from AGI-safety-oriented groups like those at DeepMind, OpenAI, MIRI, and in academia, but also AI-ethics-oriented groups as well, such as the ACM Future of Computing Academy:

https://acm-fca.org/2018/03/29/negativeimpacts/

Just as folks interested in AI safety and ethics needed to start thinking beyond capabilities, folks interested in AI existential safety need to start thinking beyond alignment. The next section describes what I think this means for technical work.

Anticipating, legitimizing and fulfilling governance demands

The main way I can see present-day technical research benefitting existential safety is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise over the next 10-30 years. In short, there often needs to be some amount of traction on a technical area before it’s politically viable for governing bodies to demand that institutions apply and improve upon solutions in those areas. Here’s what I mean in more detail:

By governance demands, I’m referring to social and political pressures to ensure AI technologies will produce or avoid certain societal-scale effects. Governance demands include pressures like “AI technology should be fair”, “AI technology should not degrade civic integrity”, or “AI technology should not lead to human extinction.” For instance, Twitter’s recent public decision to maintain a civic integrity policy can be viewed as a response to governance demand from its own employees and surrounding civic society.

Governance demand is distinct from consumer demand, and it yields a different kind of transaction when the demand is met. In particular, when a tech company fulfills a governance demand, the company legitimizes that demand by providing evidence that it is possible to fulfill. This might require the company to break ranks with other technology companies who deny that the demand is technologically achievable.

By legitimizing governance demands, I mean making it easier to establish common knowledge that a governance demand is likely to become a legal or professional standard. But how can technical research legitimize demands from a non-technical audience?

The answer is to genuinely demonstrate in advance that the governance demands are feasible to meet. Passing a given professional standard or legislation usually requires the demands in it to be “reasonable” in terms of appearing to be technologically achievable. Thus, computer scientists can help legitimize a governance demand by anticipating the demand in advance, and beginning to publish solutions for it. My position here is not that the solutions should be exaggerated in their completeness, even if that will increase ‘legitimacy’; I argue only that we should focus energy on finding solutions that, if communicated broadly and truthfully, will genuinely raise confidence that important governance demands are feasible. (Without this ethic against exaggeration, common knowledge of the legitimacy of legitimacy itself is degraded, which is bad, so we shouldn’t exaggerate.)

This kind of work can make a big difference to the future. If the algorithmic techniques needed to meet a given governance demand are 10 years of research away from discovery---as opposed to just 1 year---then it’s easier for large companies to intentionally or inadvertently maintain a narrative that the demand is unfulfillable and therefore illegitimate. Conversely, if the algorithmic techniques to fulfill the demand already exist, it’s a bit harder (though still possible) to deny the legitimacy of the demand. Thus, CS researchers can legitimize certain demands in advance, by beginning to prepare solutions for them.

I think this is the most important kind of work a computer scientist can do in service of existential safety. For instance, I view ML fairness and interpretability research as responding to existing governance demand, which (genuinely) legitimizes the cause of AI governance itself, which is hugely important. Furthermore, I view computational social choice research as addressing an upcoming governance demand, which is even more important.

My hope in writing this post is that some of the readers here will start trying to anticipate AI governance demands that will arise over the next 10-30 years. In doing so, we can begin to think about technical problems and solutions that could genuinely legitimize and fulfill those demands when they arise, with a focus on demands whose fulfillment can help stabilize society in ways that mitigate existential risks.

Research Areas

Alright, let’s talk about some research!

Out of distribution robustness (OODR)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Out of Distribution Robustness	Zero/Single	1/10	4/10	5/10	3/10	1/10

This area of research is concerned with avoiding risks that arise from systems interacting with contexts and environments that are changing significantly over time, such as from training time to testing time, from testing time to deployment time, or from controlled deployments to uncontrolled deployments.

OODR (un)helpfulness to existential safety:

Contributions to OODR research are not particularly helpful to existential safety in my opinion, for a combination of two reasons:

Progress in OODR will mostly be used to help roll out more AI technologies into active deployment more quickly, and
Research in this area usually does not involve deep or lengthy reflections about the structure of society and human values and interactions, which I think makes this field sort of collectively blind to the consequences of the technologies it will help build.

I think this area would be more helpful if it were more attentive to the structure of the multi-agent context that AI systems will be in. Professor Tom Dietterich has made some attempts to shift thinking on robustness to be more attentive to the structure of robust human institutions, which I think is a good step:

Robust artificial intelligence and robust human organizations (2018) Dietterich, Thomas G.

Unfortunately, the above paper has only 8 citations at the time of writing (very little for AI/ML), and there does not seem to be much else in the way of publications that address societal-scale or even institutional-scale robustness.

OODR educational value:

Studying and contributing to OODR research is of moderate educational value for people thinking about x-risk, in my opinion. Speaking for myself, it helps me think about how society as a whole is receiving a changing distribution of inputs from its environment (which society itself is creating). As human society changes, the inputs to AI technologies will change, and we want the existence of human society to be robust to those changes. I don’t think most researchers in this area think about it in that way, but that doesn’t mean you can’t.

OODR neglect:

Robustness to changing environments has never been a particularly neglected concept in the history of automation, and it is not likely to ever become neglected, because myopic commercial incentives push so strongly in favor of progress on it. Specifically, robustness of AI systems is essential for tech companies to be able to roll out AI-based products and services, so there is no lack of incentive for the tech industry to work on robustness. In reinforcement learning specifically, robustness has been somewhat neglected, although less so now than in 2015, partly thanks to AI safety (broadly construed) taking off. I think by 2030 this area will be even less neglected, even in RL.

OODR exemplars:

Recent exemplars of high value to existential safety, according to me:

(2018) Robust artificial intelligence and robust human organizations, Dietterich, Thomas G*
*The above paper is not really about out of distribution robustness, but among papers I’ve found appreciably valuable to x-safety, it’s the closest.

Recent exemplars of high educational value, according to me:

(2016) Doubly robust off-policy value evaluation for reinforcement learning, Jiang, Nan; Li, Lihong.*
*Not directly about distributional shift, but valuable to this area in my opinion.
(2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks, Hendrycks, Dan; Gimpel, Kevin.
(2017) Enhancing the reliability of out-of-distribution image detection in neural networks, Liang, Shiyu; Li, Yixuan; Srikant, R.
(2017) Training confidence-calibrated classifiers for detecting out-of-distribution samples, (2017), Lee, Kimin; Lee, Honglak; Lee, Kibok; Shin, Jinwoo.
(2018) Learning confidence for out-of-distribution detection in neural networks, DeVries, Terrance; Taylor, Graham W.

Agent foundations (AF)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Agent Foundations	Zero/Single	3/10	8/10	9/10	8/10	7/10

This area is concerned with developing and investigating fundamental definitions and theorems pertaining to the concept of agency. This often includes work in areas such as decision theory, game theory, and bounded rationality. I’m going to write more for this section because I know more about it and think it’s pretty important to “get right”.

AF (un)helpfulness to existential safety:

Contributions to agent foundations research are key to the foundations of AI safety and ethics, but are also potentially misusable. Thus, arbitrary contributions to this area are not necessarily helpful, while targeted contributions aimed at addressing real-world ethical problems could be extremely helpful. Here is why I believe this:

I view agent foundations work as looking very closely at the fundamental building blocks of society, i.e., agents and their decisions. It’s important to understand agents and their basic operations well, because we’re probably going to produce (or allow) a very large number of them to exist/occur. For instance, imagine any of the following AI-related operations happening at least 1,000,000 times (a modest number given the current world population):

A human being delegates a task to an AI system to perform, thereby ceding some control over the world to the AI system.
An AI system makes a decision that might yield important consequences for society, and acts on it.
A company deploys an AI system into a new context where it might have important side effects.
An AI system builds or upgrades another AI system (possibly itself) and deploys it.
An AI system interacts with another AI system, possibly yielding externalities for society.
An hour passes where AI technology is exerting more control over the state of the Earth than humans are.

Suppose there's some class of negative outcomes (e.g. human extinction) that we want to never occur as a result of any of these operations. In order to be just 55% sure that all of these 1,000,000 operations will be safe (i.e., avoid the negative outcome class), on average (on a log scale) we need to be at least 99.99994% sure that each instance of the operation is safe (i.e., will not precipitate the negative outcome). Similarly, for any accumulable quantity of “societal destruction” (such as risk, pollution, or resource exhaustion), in order to be sure that these operations will not yield “100 units” of societal destruction, we need each operation on average to produce at most “0.00001 units” of destruction.*

(*Would-be-footnote: Incidentally, the main reason I think OODR research is educationally valuable is that it can eventually help with applying agent foundations research to societal-scale safety. Specifically: how can we know if one of the operations (a)-(f) above is safe to perform 1,000,000 times, given that it was safe the first 1,000 times we applied it in a controlled setting, but the setting is changing over time? This is a special case of an OODR question.)

Unfortunately, understanding the building blocks of society can also allow the creation of potent societal forces that would harm society. For instance, understanding human decision-making extremely well might help advertising companies to control public opinion to an unreasonable degree (which arguably has already happened, even with today’s rudimentary agent models), or it might enable the construction of a super-decision-making system that is misaligned with human existence.

That said, I don’t think this means you have to be super careful about information security around agent foundations work, because in general it’s not easy to communicate fundamental theoretical results in research, let alone by accident.

Rather, my recommendation for maximizing the positive value of work in this area is to apply the insights you get from it to areas that make it easier to represent societal-scale moral values in AI. E.g., I think applications of agent foundations results to interpretability, fairness, computational social choice, and accountability are probably net good, whereas applications to speed up arbitrary ML capabilities are not obviously good.

AF educational value:

Studying and contributing to agent foundations research has the highest educational value for thinking about x-risk among the research areas listed here, in my opinion. The reason is that agent foundations research does the best job of questioning potentially faulty assumptions underpinning our approach to existential safety. In particular, I think our understanding of how to safely integrate AI capabilities with society is increasingly contingent on our understanding of agent foundations work as defining the building blocks of society.

AF neglect:

This area is extremely neglected in my opinion. I think around 50% of the progress in this area, worldwide, happens at MIRI, which has a relatively small staff of agent foundations researchers. While MIRI has grown over the past 5 years, agent foundations work in academia hasn’t grown much, and I don’t expect it to grow much by default (though perhaps posts like this might change that default).

AF exemplars:

Below are recent exemplars of agent foundations work that I think is of relatively high value to existential safety, mostly via their educational value to understanding the foundations of how agents work ("agent foundations"). The work is mostly from three main clusters: MIRI, Vincent Conitzer's group at Duke, and Joe Halpern's group at Cornell.

(2015) Translucent players: Explaining cooperative behavior in social dilemmas, Capraro, Valerio; Halpern, Joseph Y.
(2016) Logical induction, Garrabrant, Scott; Benson-Tilsen, Tsvi; Critch, Andrew; Soares, Nate; Taylor, Jessica. *
*COI note: I am a coauthor on the above paper. If many other people were writing existential safety appraisals like this post, I’d omit my own papers from this list and defer to others to judge them.
(2016) Reflective oracles: A foundation for game theory in artificial intelligence, Fallenstein, Benja; Taylor, Jessica; Christiano, Paul F.
(2017) Functional decision theory: A new theory of instrumental rationality, Yudkowsky, Eliezer; Soares, Nate.
(2017) Disarmament games, Deng, Yuan; Conitzer, Vincent.
(2018) Game theory with translucent players, Halpern, Joseph Y; Pass, Rafael.
(2019) Embedded agency, Demski, Abram; Garrabrant, Scott.
(2019) A parametric, resource-bounded generalization of loeb’s theorem, and a robust cooperation criterion for open-source game theory (2019) Critch, Andrew.*
*COI note: I am the author of the above paper. If many other people were writing existential safety appraisals like this post, I’d omit my own papers from this list and defer to others to judge them.
(2019) Risks from Learned Optimization in Advanced Machine Learning Systems, Hubinger, Evan; van Merwijk, Chris; Mikulik, Vladimir; Skalse, Joar; Garrabrant, Scott.

Multi-agent reinforcement learning (MARL)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Multi-agent RL	Zero/Multi	2/10	6/10	5/10	4/10	0/10

MARL is concerned with training multiple agents to interact with each other and solve problems using reinforcement learning. There are a few varieties to be aware of:

Cooperative vs competitive vs adversarial tasks: do the agents all share a single objective, or separate objectives that are imperfectly aligned, or completely opposed (zero-sum) objective?
Centralized training vs decentralized training: is there a centralized process that observes the agents and controls how they learn, or is there a separate (private) learning process for each agent?
Communicative vs non-communicative: is there a special channel the agents can use to generate observations for each other that are otherwise inconsequential, or are all observations generated in the course of consequential actions?

I think the most interesting MARL research involves decentralized training for competitive objectives in communicative environments, because this set-up is the most representative of how AI systems from diverse human institutions are likely to interact.

MARL (un)helpfulness to existential safety:

Contributions to MARL research are mostly not very helpful to existential safety in my opinion, because MARL’s most likely use case will be to help companies to deploy fleets of rapidly interacting machines that might pose risks to human society. The MARL projects with the greatest potential to help are probably those that find ways to achieve cooperation between decentrally trained agents in a competitive task environment, because of its potential to minimize destructive conflicts between fleets of AI systems that cause collateral damage to humanity. That said, even this area of research risks making it easier for fleets of machines to cooperate and/or collude at the exclusion of humans, increasing the risk of humans becoming gradually disenfranchised and perhaps replaced entirely by machines that are better and faster at cooperation than humans.

MARL educational value:

I think MARL has a high educational value, because it helps researchers to observe directly how difficult it is to get multi-agent systems to behave well. I think most of the existential risk from AI over the next decades and centuries comes from the incredible complexity of behaviors possible from multi-agent systems, and from underestimating that complexity before it takes hold in the real world and produces unexpected negative side effects for humanity.

MARL neglect:

MARL was somewhat neglected 5 years ago, but has picked up a lot. I suspect MARL will keep growing in popularity because of its value as a source of curricula for learning algorithms. I don’t think it is likely to become more civic-minded, unless arguments along the lines of this post lead to a shift of thinking in the field.

MARL exemplars:

Recent exemplars of high educational value, according to me:

(2015) Cooperating with unknown teammates in complex domains: A robot soccer case study of ad hoc teamwork, Barrett, Samuel; Stone, Peter.
(2016) Learning to communicate with deep multi-agent reinforcement learning, Foerster, Jakob; Assael, Ioannis Alexandros; de Freitas, Nando; Whiteson, Shimon.
(2017) Emergent complexity via multi-agent competition, Bansal, Trapit; Pachocki, Jakub; Sidor, Szymon; Sutskever, Ilya; Mordatch, Igor.
(2017) Making friends on the fly: Cooperating with new teammates, Barrett, Samuel; Rosenfeld, Avi; Kraus, Sarit; Stone, Peter.
(2017) Multi-agent actor-critic for mixed cooperative-competitive environments, Lowe, Ryan; Wu, Yi; Tamar, Aviv; Harb, Jean; Abbeel, OpenAI Pieter; Mordatch, Igor.
(2017) Multiagent cooperation and competition with deep reinforcement learning, Tampuu, Ardi; Matiisen, Tambet; Kodelja, Dorian; Kuzovkin, Ilya; Korjus, Kristjan; Aru, Juhan; Aru, Jaan; Vicente, Raul.
(2017) Stabilising experience replay for deep multi-agent reinforcement learning, Foerster, Jakob; Nardelli, Nantas; Farquhar, Gregory; Afouras, Triantafyllos; Torr, Philip HS; Kohli, Pushmeet; Whiteson, Shimon.
(2017) Counterfactual multi-agent policy gradients, Foerster, Jakob; Farquhar, Gregory; Afouras, Triantafyllos; Nardelli, Nantas; Whiteson, Shimon.
(2017) Learning with opponent-learning awareness, Foerster, Jakob N; Chen, Richard Y; Al-Shedivat, Maruan; Whiteson, Shimon; Abbeel, Pieter; Mordatch, Igor.
(2018) Autonomous agents modelling other agents: A comprehensive survey and open problems, Albrecht, Stefano V; Stone, Peter.
(2018) Learning to share and hide intentions using information regularization, Strouse, DJ; Kleiman-Weiner, Max; Tenenbaum, Josh; Botvinick, Matt; Schwab, David J.
(2018) Inequity aversion improves cooperation in intertemporal social dilemmas, Hughes, Edward; Leibo, Joel Z; Phillips, Matthew; Tuyls, Karl; Duenez-Guzman, Edgar; Castaneda, Antonio Garcia; Dunning, Iain; Zhu, Tina; McKee, Kevin; Koster, Raphael; others.
(2019) Social influence as intrinsic motivation for multi-agent deep reinforcement learning, Jaques, Natasha; Lazaridou, Angeliki; Hughes, Edward; Gulcehre, Caglar; Ortega, Pedro; Strouse, DJ; Leibo, Joel Z; De Freitas, Nando.
(2019) Policy-gradient algorithms have no guarantees of convergence in continuous action and state multi-agent settings, Mazumdar, Eric; Ratliff, Lillian J; Jordan, Michael I; Sastry, S Shankar.

Preference learning (PL)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Preference Learning	Single/Single	1/10	4/10	5/10	1/10	0/10

This area is concerned with learning about human preferences in a form usable for guiding the policies of artificial agents. In an RL (reinforcement learning) setting, preference learning is often called reward learning, because the learned preferences take the form of a reward function for training an RL system.

PL (un)helpfulness to existential safety:

Contributions to preference learning are not particularly helpful to existential safety in my opinion, because their most likely use case is for modeling human consumers just well enough to create products they want to use and/or advertisements they want to click on. Such advancements will be helpful to rolling out usable tech products and platforms more quickly, but not particularly helpful to existential safety.*

Preference learning is of course helpful to AI alignment, i.e., the problem of getting an AI system to do something a human wants. Please refer back to the sections above on Defining our objectives and Distinguishing our objectives for an elaboration of how this is not the same as AI existential safety. In any case, I see AI alignment in turn as having two main potential applications to existential safety:

AI alignment is useful as a metaphor for thinking about how to align the global effects of AI technology with human existence, a major concern for AI governance at a global scale, and
AI alignment solutions could be used directly to govern powerful AI technologies designed specifically to make the world safer.

While many researchers interested in AI alignment are motivated by (1) or (2), I find these pathways of impact problematic. Specifically,

(1) elides the complexities of multi-agent interactions I think are likely to arise in most realistic futures, and I think the most difficult to resolve existential risks arise from those interactions.
(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

Moreover, I believe contributions to AI alignment are also generally unhelpful to existential safety, for the same reasons as preference learning. Specifically, progress in AI alignment hastens the pace at which high-powered AI systems will be rolled out into active deployment, shortening society’s headway for establishing international treaties governing the use of AI technologies.

Thus, the existential safety value of AI alignment research in its current technical formulations—and preference learning as a subproblem of it—remains educational in my view.*

(*Would-be-footnote: I hope no one will be too offended by this view. I did have some trepidation about expressing it on the “alignment’ forum, but I think I should voice these concerns anyway, for the following reason. In 2011 after some months of reflection on a presentation by Andrew Ng, I came to believe that that deep learning was probably going to take off, and that, contrary to Ng’s opinion, this would trigger a need for a lot of AI alignment work in order to make the technology safe. This feeling of worry is what triggered me to cofound CFAR and start helping to build a community that thinks more critically about the future. I currently have a similar feeling of worry toward preference learning and AI alignment, i.e., that it is going to take off and trigger a need for a lot more “AI civility” work that seems redundant or “too soon to think about” for a lot of AI alignment researchers today, the same way that AI researchers said it was “too soon to think about” AI alignment. To the extent that I think I was right to be worried about AI progress kicking off in the decade following 2011, I think I’m right to be worried again now about preference learning and AI alignment (in its narrow and socially-simplistic technical formulations) taking off in the 2020’s and 2030’s.)

PL educational value:

Studying and making contributions to preference learning is of moderate educational value for thinking about existential safety in my opinion. The reason is this: if we want machines to respect human preferences—including our preference to continue existing—we may need powerful machine intelligences to understand our preferences in a form they can act on. Of course, being understood by a powerful machine is not necessarily a good thing. But if the machine is going to do good things for you, it will probably need to understand what “good for you” means. In other words, understanding preference learning can help with AI alignment research, which can help with existential safety. And if existential safety is your goal, you can try to target your use of preference learning concepts and methods toward that goal.

PL neglect:

Preference learning has always been crucial to the advertising industry, and as such it has not been neglected in recent years. For the same reason, it’s also not likely to become neglected. Its application to reinforcement learning is somewhat new, however, because until recently there was much less active research in reinforcement learning. In other words, recent interest in reward learning is mainly a function of increased interest in reinforcement learning, rather than increased interest in preference learning. If new learning paradigms supersede reinforcement learning, preference learning for those paradigms will not be far behind.

(This is not a popular opinion; I apologize if I have offended anyone who believes that progress in preference learning will reduce existential risk, and I certainly welcome debate on the topic.)

PL exemplars:

Recent works of significant educational value, according to me:

(2017) Deep reinforcement learning from human preferences, Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario.
(2018) Reward learning from human preferences and demonstrations in Atari, Ibarz, Borja; Leike, Jan; Pohlen, Tobias; Irving, Geoffrey; Legg, Shane; Amodei, Dario.
(2018) The alignment problem for Bayesian history-based reinforcement learners, Everitt, Tom; Hutter, Marcus.
(2019) Learning human objectives by evaluating hypothetical behavior, Reddy, Siddharth; Dragan, Anca D; Levine, Sergey; Legg, Shane; Leike, Jan.
(2019) On the feasibility of learning, rather than assuming, human biases for reward inference, Shah, Rohin; Gundotra, Noah; Abbeel, Pieter; Dragan, Anca D.
(2020) Reward-rational (implicit) choice: A unifying formalism for reward learning, Jeon, Hong Jun; Milli, Smitha; Dragan, Anca D.

Human-robot interaction (HRI)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Human-Robot Interaction	Single/Single	6/10	7/10	5/10	4/10	3/10

HRI research is concerned with designing and optimizing patterns of interaction between humans and machines—usually actual physical robots, but not always.

HRI helpfulness to existential safety:

On net, I think AI/ML would be better for the world if most of its researchers pivoted from general AI/ML into HRI, simply because it would force more AI/ML researchers to more frequently think about real-life humans and their desires, values, and vulnerabilities. Moreover, I think it reasonable (as in, >1% likely) that such a pivot might actually happen if, say, 100 more researchers make this their goal.

For this reason, I think contributions to this area today are pretty solidly good for existential safety, although not perfectly so: HRI research can also be used to deceive humans, which can degrade societal-scale honesty norms, and I’ve seen HRI research targeting precisely that. However, my model of readers of this blog is that they’d be unlikely to contribute to those parts of HRI research, such that I feel pretty solidly about recommending contributions to HRI.

HRI educational value:

I think HRI work is of unusually high educational value for thinking about existential safety, even among other topics in this post. The reason is that, by working with robots, HRI work is forced to grapple with high-dimensional and continuous state spaces and action spaces that are too complex for the human subjects involved to consciously model. This, to me, crucially mirrors the relationship between future AI technology and human society: humanity, collectively, will likely be unable to consciously grasp the full breadth of states and actions that our AI technologies are transforming and undertaking for us. I think many AI researchers outside of robotics are mostly blind to this difficulty, which on its own is an argument in favor of more AI researchers working in robotics. The beauty of HRI is that it also explicitly and continually thinks about real human beings, which I think is an important mental skill to practice if you want to protect humanity collectively from existential disasters.

HRI neglect:

A neglect score for this area was uniquely difficult for me to specify. On one hand, HRI is a relatively established and vibrant area of research compared with some of the more nascent areas covered in this post. On the other hand, as mentioned, I’d eventually like to see the entirety of AI/ML as a field pivoting toward HRI work, which means it is still very neglected compared to where I want to see it. Furthermore, I think such a pivot is actually reasonable to achieve over the next 20-30 years. Further still, I think industrial incentives might eventually support this pivot, perhaps on a similar timescale.

So: if the main reason you care about neglect is that you are looking to produce a strong founder effect, you should probably discount my numerical neglect scores for this area, given that it’s not particularly “small” on an absolute scale compared to the other areas here. By that metric, I’d have given something more like {2015:4/10; 2020:3/10; 2030:2/10}. On the other hand, if you’re an AI/ML researcher looking to “do the right thing” by switching to an area that pretty much everyone should switch into, you definitely have my “doing the right thing” assessment if you switch into this area, which is why I’ve given it somewhat higher neglect scores.

HRI exemplars:

(2015) Shared autonomy via hindsight optimization, Javdani, Shervin; Srinivasa, Siddhartha S; Bagnell, J Andrew.
(2015) Learning preferences for manipulation tasks from online coactive feedback, Jain, Ashesh; Sharma, Shikhar; Joachims, Thorsten; Saxena, Ashutosh.
(2016) Cooperative inverse reinforcement learning, Hadfield-Menell, Dylan; Russell, Stuart J; Abbeel, Pieter; Dragan, Anca.
(2017) Planning for autonomous cars that leverage effects on human actions., Sadigh, Dorsa; Sastry, Shankar; Seshia, Sanjit A; Dragan, Anca D.
(2017) Should robots be obedient?, Milli, Smitha; Hadfield-Menell, Dylan; Dragan, Anca; Russell, Stuart.
(2019) Where do you think you're going?: Inferring beliefs about dynamics from behavior, Reddy, Sid; Dragan, Anca; Levine, Sergey.
(2019) Literal or Pedagogic Human? Analyzing Human Model Misspecification in Objective Learning, Milli, Smitha; Dragan, Anca D.
(2019) Hierarchical game-theoretic planning for autonomous vehicles, Fisac, Jaime F; Bronstein, Eli; Stefansson, Elis; Sadigh, Dorsa; Sastry, S Shankar; Dragan, Anca D.
(2020) Pragmatic-pedagogic value alignment, Fisac, Jaime F; Gates, Monica A; Hamrick, Jessica B; Liu, Chang; Hadfield-Menell, Dylan; Palaniappan, Malayandi; Malik, Dhruv; Sastry, S Shankar; Griffiths, Thomas L; Dragan, Anca D.

Side-effect minimization (SEM)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Side-effect Minimization	Single/Single	4/10	4/10	6/10	5/10	4/10

SEM research is concerned with developing domain-general methods for making AI systems less likely to produce side effects, especially negative side effects, in the course of pursuing an objective or task.

SEM helpfulness to existential safety:

I think this area has two obvious applications to safety-in-general:

(“accidents”) preventing an AI agent from “messing up” when performing a task for its primary stakeholder(s), and
(“externalities”) preventing an AI system from generating problems for persons other than its primary stakeholders, either
1. (“unilateral externalities”) when the system generates externalities through its unilateral actions, or
2. (“multilateral externalities”) when the externalities are generated through the interaction of an AI system with another entity, such as a non-stakeholder or another AI system.

I think the application to externalities is more important and valuable than the application to accidents, because I think externalities are (even) harder to detect and avoid than accidents. Moreover, I think multilateral externalities are (even!) harder to avoid than unilateral externalities.

Currently, SEM research is focussed mostly on accidents, which is why I’ve only given it a moderate score on the helpfulness scale. Conceptually, it does make sense to focus on accidents first, then unilateral externalities, and then multilateral externalities, because of the increasing difficulty in addressing them.

However, the need to address multilateral externalities will arise very quickly after unilateral externalities are addressed well enough to roll out legally admissible products, because most of our legal systems have an easier time defining and punishing negative outcomes that have a responsible party. I don’t believe this is a quirk of human legal systems: when two imperfectly aligned agents interact, they complexify each other’s environment in a way that consumes more cognitive resources than interacting with a non-agentic environment. (This is why MARL and self-play are seen as powerful curricula for learning.) Thus, there is less cognitive “slack” to think about non-stakeholders in a multi-agent setting than in a single-agent setting.

For this reason, I think work that makes it easy for AI systems and their designers to achieve common knowledge around how the systems should avoid producing externalities is very valuable.

SEM educational value:

I think SEM research thus far is of moderate educational value, mainly just to kickstart your thinking about side effects.

SEM neglect:

Domain-general side-effect minimization for AI is a relatively new area of research, and is still somewhat neglected. Moreover, I suspect it will remain neglected, because of the aforementioned tendency for our legal system to pay too little attention to multilateral externalities, a key source of negative side effects for society.

SEM exemplars:

Recent exemplars of value to existential safety, mostly via starting to think about the generalized concept of side effects at all:

(2018) Penalizing side effects using stepwise relative reachability, Krakovna, Victoria; Orseau, Laurent; Kumar, Ramana; Martic, Miljan; Legg, Shane
(2019) Safelife 1.0: Exploring side effects in complex environments, Wainwright, Carroll L; Eckersley, Peter
(2019) Preferences Implicit in the State of the World, Shah, Rohin; Krasheninnikov, Dmitrii; Alexander, Jordan; Abbeel, Pieter; Dragan, Anca
(This paper is about preference inference, but I think it applies more specifically to inferring how not to have negative side effects.)
(2020) Conservative agency via attainable utility preservation, Turner, Alexander Matt; Hadfield-Menell, Dylan; Tadepalli, Prasad

Interpretability in ML (IntML)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Interpretability in ML	Single/Single	8/10	6/10	8/10	6/10	2/10

Interpretability research is concerned with making the reasoning and decisions of AI systems more interpretable to humans. Interpretability is closely related to transparency and explainability. Not all authors treat these three concepts as distinct; however, I think when useful distinction is drawn between between them, it often looks something like this:

a system is “transparent” if it is easy for human users or developers to observe and track important parameters of its internal state;
a system is “explainable” if useful explanations of its reasoning can be produced after the fact.
a system is “interpretable” if its reasoning is structured in a manner that does not require additional engineering work to produce accurate human-legible explanations.

In other words, interpretable systems are systems with the property that transparency is adequate for explainability: when we look inside them, we find they are structured in a manner that does not require much additional explanation. I see Professor Cynthia Rudin as the primary advocate for this distinguished notion of interpretability, and I find it to be an important concept to distinguish.

IntML helpfulness to existential safety:

I think interpretability research contributes to existential safety in a fairly direct way on the margin today. Specifically, progress in interpretability will

decrease the degree to which human AI developers will end up misjudging the properties of the systems they build,
increase the degree to which systems and their designers can be held accountable for the principles those systems embody, perhaps even before those principles have a chance to manifest in significant negative societal-scale consequences, and
potentially increase the degree to which competing institutions and nations can establish cooperation and international treaties governing AI-heavy operations.

I believe this last point may turn out to be the most important application of interpretability work. Specifically, I think institutions that use a lot of AI technology (including but not limited to powerful autonomous AI systems) could become opaque to one another in a manner that hinders cooperation between and governance of those systems. By contrast, a degree of transparency between entities can facilitate cooperative behavior, a phenomenon which has been borne out in some of the agent foundations work listed above, specifically:

(2015) Translucent players: Explaining cooperative behavior in social dilemmas, Capraro, Valerio; Halpern, Joseph Y.
(2018) Game theory with translucent players, Halpern, Joseph Y; Pass, Rafael.
(2019) A parametric, resource-bounded generalization of loeb’s theorem, and a robust cooperation criterion for open-source game theory (2019) Critch, Andrew.

In other words, I think interpretability research can enable technologies that legitimize and fulfill AI governance demands, narrowing the gap between what policy makers will wish for and what technologists will agree is possible.

IntML educational value:

I think interpretability research is of moderately high educational value for thinking about existential safety, because some research in this area is somewhat surprising in terms of showing ways to maintain interpretability without sacrificing much in the way of performance. This can change our expectations about how society can and should be structured to maintain existential safety, by changing the degree of interpretability we can and should expect from AI-heavy institutions and systems.

IntML neglect:

I think IntML is fairly neglected today relative to its value. However, over the coming decade, I think there will be opportunities for companies to speed up their development workflows by improving the interpretability of systems to their developers. In fact, I think for many companies interpretability is going to be a crucial bottleneck for advancing their product development. These developments won’t be my favorite applications of interpretability, and I might eventually become less excited about contributions to interpretability if all of the work seems oriented on commercial or militarized objectives instead of civic responsibilities. But in any case, I think getting involved with interpretability research today is a pretty robustly safe and valuable career move for any up and coming AI researchers, especially if they do their work with an eye toward existential safety.

IntML exemplars:

Recent exemplars of high value to existential safety:

(2015) Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model, Letham, Benjamin; Rudin, Cynthia; McCormick, Tyler H; Madigan, David; others.
(2017) Towards a rigorous science of interpretable machine learning, Doshi-Velez, Finale; Kim, Been.
(2018) The mythos of model interpretability, Lipton, Zachary C.
(2018) The building blocks of interpretability,Olah, Chris; Satyanarayan, Arvind; Johnson, Ian; Carter, Shan; Schubert, Ludwig; Ye, Katherine; Mordvintsev, Alexander.
(2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead, Rudin, Cynthia.
(2019) This looks like that: deep learning for interpretable image recognition, Chen, Chaofan; Li, Oscar; Tao, Daniel; Barnett, Alina; Rudin, Cynthia; Su, Jonathan K.
(2019) A study in Rashomon curves and volumes: A new perspective on generalization and model simplicity in machine learning, Semenova, Lesia; Rudin, Cynthia; Parr, Ronald.

Fairness in ML (FairML)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Fairness in ML	Multie/Single	6/10	5/10	7/10	3/10	2/10

Fairness research in machine learning is typically concerned with altering or constraining learning systems to make sure their decisions are “fair” according to a variety of definitions of fairness.

FairML helpfulness to existential safety:

My hope for FairML as a field contributing to existential safety is threefold:

(societal-scale thinking) Fairness comprises one or more human values that exist in service of society as a whole, and which are currently difficult to encode algorithmically, especially in a form that will garner unchallenged consensus. Getting more researchers to think in the framing “How do I encode a value that will serve society as a whole in a broadly agreeable way” is good for big-picture thinking and hence for society-scale safety problems.
(social context awareness) FairML gets researchers to “take off their blinders” to the complexity of society surrounding them and their inventions. I think this trend is gradually giving AI/ML researchers a greater sense of social and civic responsibility, which I think reduces existential risk from AI/ML.
(sensitivity to unfair uses of power) Simply put, it’s unfair to place all of humanity at risk without giving all of humanity a chance to weigh in on that risk. More focus within CS on fairness as a human value could help alleviate this risk. Specifically, fairness debates often trigger redistributions of resources in a more equitable manner, thus working against the over-centralization of power within a given group. I have some hope that fairness considerations will work against the premature deployment of powerful AI/ML systems that would lead to the hyper-centralizing power over the world (and hence would pose acute global risks by being a single point of failure).
(Fulfilling and legitimizing governance demands) Fairness research can be used to fulfill and legitimize AI governance demands, narrowing the gap between what policy makers wish for and what technologists agree is possible. This process makes AI as a field more amenable to governance, thereby improving existential safety.

FairML educational value:

I think FairML research is of moderate educational value for thinking about existential safety, mainly via the opportunities it creates for thinking about the points in the section on helpfulness above. If the field were more mature, I would assign it a higher educational value.

I should also flag that most work in FairML has not been done with existential safety in mind. Thus, I’m very much hoping that more people who care about existential safety will learn about FairML and begin thinking about how principles of fairness can be leveraged to ensure societal-scale safety in the not-too-distant future.

FairML neglect:

FairML is not a particularly neglected area at the moment because there is a lot of excitement about it, and I think it will continue to grow. However, it was relatively neglected 5 years ago, so there is still a lot of room for new ideas in the space. Also, as mentioned, thinking in FairML is not particularly oriented toward existential safety, so I think research on fairness in service of societal-scale safety is quite neglected in my opinion.

FairML exemplars:

Recent exemplars of high value to existential safety, mostly via attention to the problem of difficult-to-codify societal-scale values:

(2017) Inherent trade-offs in the fair determination of risk scores, Kleinberg, Jon; Mullainathan, Sendhil; Raghavan, Manish.
(2017) On fairness and calibration, Pleiss, Geoff; Raghavan, Manish; Wu, Felix; Kleinberg, Jon; Weinberger, Kilian Q.
(2018) Fairness and accountability design needs for algorithmic support in high-stakes public sector decision-making, Veale, Michael; Van Kleek, Max; Binns, Reuben.
(2018) Delayed impact of fair machine learning,Conitzer, Vincent; Freeman, Rupert; Shah, Nisarg.
(2018) Fairness definitions explained, Verma, Sahil; Rubin, Julia.
(2019) Fairness and abstraction in sociotechnical systems, Selbst, Andrew D; Boyd, Danah; Friedler, Sorelle A; Venkatasubramanian, Suresh; Vertesi, Janet.

Computational Social Choice (CSC)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Computational Social Choice	Multi/Single	7/10	7/10	7/10	5/10	4/10

Computational social choice research is concerned with using algorithms to model and implement group-level decisions using individual-scale information and behavior as inputs. I view CSC as a natural next step in the evolution of social choice theory that is more attentive to the implementation details of both agents and their environments. In my conception, CSC comprises subservient topics in mechanism design and algorithmic game theory, even if researchers in those areas don’t consider themselves to be working in computational social choice.

CSC helpfulness to existential risk:

In short, computational social choice research will be necessary to legitimize and fulfill governance demands for technology companies (automated and human-run companies alike) to ensure AI technologies are beneficial to and controllable by human society. The process of succeeding or failing to legitimize such demands will lead to improving and refining what I like to call the algorithmic social contract: whatever broadly agreeable set of principles (if any) algorithms are expected to obey in relation to human society.

In 2018, I considered writing an article drawing more attention to the importance of developing an algorithmic social contract, but found this point had already been quite eloquently by Iyad Rahwan in the following paper, which I highly recommend:

(2018) Society-in-the-loop: programming the algorithmic social contract, Rahwan, Iyad

Computational social choice methods in their current form are certainly far from providing adequate and complete formulations of an algorithmic social contract. See the following article for arguments against tunnel-vision on computational social choice as a complete solution to societal-scale AI ethics:

(2020) Social choice ethics in artificial intelligence, Baum, Seth D

Notwithstanding this concern, what follows is a somewhat detailed forecast of how I think computational social choice research will still have a crucial role to play in developing the algorithmic social contract throughout the development of individually-alignable transformative AI technologies, which I’ll call “the alignment revolution”.

First, once technology companies begin to develop individually-alignable transformative AI capabilities, there will be strong economic and social and political pressures for its developers to sell those capabilities rather than hoarding them. Specifically:

(economic pressure) Selling capabilities immediately garners resources in the form of money and information from the purchasers and users of the capabilities;
(social pressure) Hoarding capabilities could be seen as anti-social relative to distributing them more broadly through sales or free services;
(sociopolitical pressure) Selling capabilities allows society to become aware that those capabilities exist, enabling a smoother transition to embracing those capabilities. This creates a broadly agreeable concrete moral argument against capability hoarding, which could become politically relevant.
(political pressure) Political elites will be happier if technical elites “share” their capabilities with the rest of the rest of the economy rather than hoarding them.

Second, for the above reasons, I expect individually-alignable transformative AI capabilities to be distributed fairly broadly once they exist, creating an “alignment revolution” arising from those capabilities. (It’s possible I’m wrong about this, and for that I reason I also welcome research on how to align non-distributed alignment capabilities; that’s just not where most of my chips lie, and not where the rest of this argument will focus.)

Third, unless humanity collectively works very hard to maintain a degree of simplicity and legibility in the overall structure of society*, this “alignment revolution” will greatly complexify our environment to a point of much greater incomprehensibility and illegibility than even today’s world. This, in turn, will impoverish humanity’s collective ability to keep abreast of important international developments, as well as our ability to hold the international economy accountable for maintaining our happiness and existence.

(*Would-be-footnote: I have some reasons to believe that perhaps we can and should work harder to make the global structure of society more legible and accountable to human wellbeing, but that is a topic for another article.)

Fourth, in such a world, algorithms will be needed to hold the aggregate global behavior of algorithms accountable to human wellbeing, because things will be happening too quickly for humans to monitor. In short, an “algorithmic government” will be needed to govern “algorithmic society”. Some might argue this is not strictly unnecessary: in the absence of a mathematically codified algorithmic social contract, humans could in principle coordinate to cease or slow down the use of these powerful new alignment technologies, in order to give ourselves more time to adjust to and govern their use. However, for all our successes in innovating laws and governments, I do not believe current human legal norms are quite developed enough to stably manage a global economy empowered with individually-alignable transformative AI capabilities.

Fifth, I do think our current global legal norms are much better than what many computer scientists naively proffer as replacements for them. My hope is that more resources and influence will slowly flow toward the areas of computer science most in touch with the nuances and complexities of codifying important societal-scale values. In my opinion, this work is mostly concentrated in and around computational social choice, to some extent mechanism design, and morally adjacent yet conceptually nascent areas of ML research such as fairness and interpretability.

While there is currently an increasing flurry of (well-deserved) activity in fairness and interpretability research, computational social choice is somewhat more mature, and has a lot for these younger fields to learn from. This is why I think CSC work is crucial to existential safety: it is the area of computer science most tailored to evoke reflection on the global structure of society, and the most mature in doing so.

So what does all this have to do with existential safety? Unfortunately, while CSC is significantly more mature as a field than interpretable ML or fair ML, it is still far from ready to fulfill governance demand at the ever-increasing speed and scale needed to ensure existential safety in the wake of individually-alignable transformative AI technologies. Moreover, I think punting these questions to future AI systems to solve for us is a terrible idea, because doing so impoverishes our ability to sanity-check whether those AI systems are giving us reasonable answers to our questions about social choice. So, on the margin I think contributions to CSC theory are highly valuable, especially by persons thinking about existential safety as the objective of their research.

CSC educational value:

Learning about CSC is necessary for contributions to CSC, which I think are currently needed to ensure existentially safe societal-scale norms for aligned AI systems to follow after “the alignment revolution” if it happens. So, I think CSC is highly valuable to learn about, with the caveat that most work in CSC has not been done with existential safety in mind. Thus, I’m very much hoping that more people who care about existential safety will learn about and begin contributing to CSC in ways that steer CSC toward issues of societal-scale safety.

CSC neglect:

As mentioned above, I think CSC is still far from ready to fulfill governance demands at the ever-increasing speed and scale that will be needed to ensure existential safety in the wake of “the alignment revolution”. That said, I do think over the next 10 years CSC will become both more imminently necessary and more popular, as more pressure falls upon technology companies to make societal-scale decisions. CSC will become still more necessary and popular as more humans and human institutions become augmented with powerful aligned AI capabilities that might “change the game” that our civilization is playing. I expect such advancements to raise increasingly deep and urgent questions about the principles on which our civilization is built, that will need technical answers in order to be fully resolved in ways that maintain existential safety.

CSC exemplars:

CSC exemplars of particular value and relevance to existential safety, mostly via their attention to formalisms for how to structure societal-scale decisions:

(2014) Dynamic social choice with evolving preferences, Parkes, David C; Procaccia, Ariel D.
(2016) Handbook of computational social choice, Brandt, Felix; Conitzer, Vincent; Endriss, Ulle; Lang, Jerome; Procaccia, Ariel D.
(2016) The revelation principle for mechanism design with reporting costs, Kephart, Andrew; Conitzer, Vincent,
(2016) Barriers to Manipulation in Voting, Conitzer, Vincent; Walsh, Toby
(2016) Proportional justified representation, Sanchez-Fernandez, Luis; Elkind, Edith; Lackner, Martin; Fernandez, Norberto; Fisteus, Jesus A; Val, Pablo Basanta; Skowron, Piotr.
(2017) Fair public decision making, Conitzer, Vincent; Freeman, Rupert; Shah, Nisarg.
(2017) Fair social choice in dynamic settings, Freeman, Rupert; Zahedi, Seyed Majid; Conitzer, Vincent.
(2017) Justified representation in approval-based committee voting, Aziz, Haris; Brill, Markus; Conitzer, Vincent; Elkind, Edith; Freeman, Rupert; Walsh, Toby.
(2020) Preference elicitation for participatory budgeting, Benade, Gerdus; Nath, Swaprava; Procaccia, Ariel D; Shah, Nisarg.
(2020) Almost envy-freeness with general valuations, Plaut, Benjamin; Roughgarden, Tim.

Accountability in ML (AccML)

Existing Research Area	Social Application	Helpfulness to Existential Safety	Educational Value	2015 Neglect	2020 Neglect	2030 Neglect
Accountability in ML	Multi/Multi	8/10	3/10	8/10	7/10	5/10

Accountability (AccML) is aimed at making it easier to hold persons or institutions accountable for the effects of ML systems. Accountability depends on transparency and explainability for evaluating the principles by which a harm or mistake occurs, but it is not subsumed by these objectives.

AccML helpfulness to existential safety:

The relevance of accountability to existential safety is mainly via the principle of accountability gaining more traction in governing the technology industry. In summary, the high level points I believe in this area are the following, which are argued for in more detail after the list:

Tech companies are currently “black boxes” to outside society, in that they can develop and implement (almost) whatever they want within the confines of privately owned laboratories (and other “secure” systems), and some of the things they develop or implement in private settings could pose significant harms to society.
Soon (or already), society needs to become less permissive of tech companies developing highly potent algorithms, even in settings that would currently be considered “private”, similar to the way we treat pharmaceutical companies developing highly potent biological specimens.
Points #1 and #2 mirror the way in which ML systems themselves are black boxes even to their creators, which fortunately is making some ML researchers uncomfortable enough to start holding conferences on accountability in ML.
More researchers getting involved in the task of defining and monitoring accountability can help tech company employees and regulators to reflect on the principle of accountability and whether tech companies themselves should be more subject to it at various scales (e.g., their software should be more accountable to its users and developers, their developers and users should be more accountable to the public, their executives should be more accountable to governments and civic society, etc.).
In futures where transformative AI technology is used to provide widespread services to many agents simultaneously (e.g., “Comprehensive AI services” scenarios), progress on defining and monitoring accountability can help “infuse” those services with a greater degree of accountability and hence safety to the rest of the world.

What follows is my narrative for how and why I believe the five points above.

At present, society is structured such that it is possible for a technology company to amass a huge amount of data and computing resources, and as long as their activities are kept “private”, they are free to use those resources to experiment with developing potentially misaligned and highly potent AI technologies. For instance, if a tech company tomorrow develops any of the following potentially highly potent technologies within a privately owned ML lab, there are no publicly mandated regulations regarding how they should handle or experiment with them:

misaligned superintelligences
fake news generators
powerful human behavior prediction and control tools
… any algorithm whatsoever

Moreover, there are virtually no publicly mandated regulations against knowingly or intentionally or developing any of these artifacts within the confines of a privately owned lab, despite the fact that the mere existence of such an artifact poses a threat to society. This is the sense in which tech companies are “black boxes” to society, and potentially harmful as such.

(That’s point #1.)

Contrast this situation with the strict guidelines that pharmaceutical companies are required to adhere to in their management of pathogens. First, it is simply illegal for most companies to knowingly develop synthetic viruses, unless they are certified to do so by demonstrating a certain capacity for safe handling of the resulting artifacts. Second, conditional on having been authorized to develop viruses, companies are required to follow standardized safety protocols. Third, companies are subject to third-party audits to ensure compliance with these safety protocols, and are not simply trusted to follow them without question.

Nothing like this is true in the tech industry, because historically, algorithms have been viewed as less potent societal-scale risks than viruses. Indeed, present-day accountability norms in tech would allow an arbitrary level of disparity to develop between

the potency (in terms of potential impact) of algorithms developed in privately owned laboratories, and
the preparedness of the rest of society to handle those impacts if the algorithms were released (such as by accident, harmful intent, or poor judgement).

This is a mistake, and an increasingly untenable position as the power of AI and ML technology increases. In particular, a number of technology companies are intentionally trying to build artificial general intelligence, an artifact which, if released, would be much more potent than most viruses. These companies do in fact have safety researchers working internally to think about how to be safe and whether to release things. But contrast this again with pharmaceuticals. It just won’t fly for a pharmaceutical company to say “Don’t worry, we don’t plan to release it; we’ll just make up our own rules for how to be privately safe with it.”. Eventually, we should probably stop accepting this position from tech companies at well.

(That’s point #2.)

Fortunately, even some researchers and developers are starting to become uncomfortable with “black boxes” playing important and consequential roles in society, as evidenced by the recent increase in attention on both accountability and interpretability in service of it, for instance:

(2016) Accountability in algorithmic decision making, Diakopoulos, Nicholas.
(2017) Accountability of AI under the law: The role of explanation, Doshi-Velez, Finale; Kortz, Mason; Budish, Ryan; Bavitz, Chris; Gershman, Sam; O'Brien, David; Schieber, Stuart; Waldo, James; Weinberger, David; Wood, Alexandra.
(2018) It's time to do something: Mitigating the negative impacts of computing through a change to the peer review process, Hecht, Brent; Wilcox, Lauren; Bigham, Jeffrey P; Schoning, Johannes; Hoque, Ehsan; Ernst, Jason; Bisk, Yonatan; Yarosh, Lana; Amjam, Bushra; Wu, Cathy.

This kind of discomfort both fuels and is fueled by decreasing levels of blind faith in the benefits of technology in general. Signs of this broader trend include:

(2020) NeurIPS 2020 FAQ --- includes references to Hecht et al (2018), Hecht (2020), and GovAI (2020) for writing about the potentially negative impacts of AI technology.
(2020) NSF America’s Seed Fund Technology Topic: AI, Atheron, Peter.
(2020) The Social Dilemma, a NetFlix documentary by Jeff Orlowski, Davis Coombe, and Vickie Curtis.

Together, these trends indicate a decreasing level of blind faith in the addition of novel technologies to society, both in the form of black-box tech products, and black-box tech companies.

(That’s point #3.)

The European General Data Protection Regulation (GDPR) is a very good step for regulating how tech companies relate with the public. I say this knowing that GDPR is far from perfect. The reason it’s still extremely valuable is that it has initialized the variable defining humanity’s collective bargaining position (at least within Europe, and replicated to some extent by the CCPA) for controlling how tech companies use data. That variable can now be amended and hence improved upon without first having to ask the question “Are we even going to try to regulate how tech companies use data?” For a while, it wasn’t clear any action would ever be taken on this front, outside of specific domains like healthcare and finance.

However, while GDPR has defined a slope for regulating the use of data, we also need accountability for private uses of computing. As AlphaZero demonstrates, data-free computing alone is sufficient to develop super-human strategic competence in a well-specified domain.

When will it be time to disallow arbitrary private uses of computing resources, irrespective of its data sources? Is it time already? My opinions on this are outside the scope of what I intend to argue for in this post. But whenever the time comes to develop and enforce such accountability, it will probably be easier to do that if researchers and developers have spent more time thinking about what accountability is, what purposes are served by various versions of accountability, and how to achieve those kinds of accountability in both fully-automated and semi-automated systems. In other words, optimistically, more technical research on accountability in ML might result in more ML researchers transferring their awareness that «black box tech products are insufficiently accountable» to become more aware/convinced that «black box tech companies are insufficiently accountable».

(That’s point #4.)

But even if that transfer of awareness doesn't happen, automated approaches to accountability will still have a role to play if we end up in a future with large numbers of agents making use of AI-mediated services, such as in the “Comprehensive AI Services” model of the future. Specifically,

individual actors in a CAIS economy should be accountable to the principle of not privately developing highly potent technologies without adhering to publicly legitimized and auditable safety procedures, and
systems for reflecting on and updating accountability structures can be used to detect and remediate problematic behaviors in multi-agent systems, including behaviors that could yield existential risks from distributed systems (e.g., extreme resource consumption or pollution effects).

(That’s point #5)

AccML educational value:

Unfortunately, technical work in this area is highly undeveloped, which is why I have assigned this area a relatively low educational value. I hope this does not trigger people to avoid contributing to it.

AccML neglect:

Correspondingly, this area is highly neglected relative to where I’d like it to be, on top of being very small in terms of the amount of technical work at its core.

AccML exemplars:

Recent examples of writing in AccML that I think are of particular value to existential safety include:

(2017) Accountability of AI under the law: The role of explanation, Doshi-Velez, Finale; Kortz, Mason; Budish, Ryan; Bavitz, Chris; Gershman, Sam; O'Brien, David; Schieber, Stuart; Waldo, James; Weinberger, David; Wood, Alexandra.
(2017) Value Alignment or Misalignment--What Will Keep Systems Accountable?, Arnold, Thomas; Kasenberg, Daniel; Scheutz, Matthias.
(2018) Towards formal definitions of blameworthiness, intention, and moral responsibility, Halpern, Joseph Y; Kleiman-Weiner, Max.*
* Note: I don't currently agree with the definitions of blameworthiness, intention, and responsibility in this paper, but I am glad to see people working toward agreeable definitions of these concepts, and I like that the title begins with "toward".
(2018) Trends and trajectories for explainable, accountable and intelligible systems: An HCI research agenda, Abdul, Ashraf; Vermeulen, Jo; Wang, Danding; Lim, Brian Y; Kankanhalli, Mohan.
(2019) Policy certificates: Towards accountable reinforcement learning, Dann, Christoph; Li, Lihong; Wei, Wei; Brunskill, Emma.

Conclusion

Thanks for reading! I hope this post has been helpful to your thinking about the value of a variety of research areas for existential safety, or at the very least, your model of my thinking. As a reminder, these opinions are my own, and are not intended to represent any institution of which I am a part.

Reflections on scope & omissions

This post has been about:

Research, not individuals. Some readers might be interested in the question “What about so-and-so’s work at such-and-such institution?” I think that’s a fair question, but I prefer this post to be about ideas, not individual people. The reason is that I want to say both positive and negative things about each area, whereas I’m not prepared to write up public statements of positive and negative judgements about people (e.g., “Such-and-such is not going to succeed in their approach”, or “So-and-so seems fundamentally misguided about X”.)
Areas, not directions. This post is an appraisal of active areas of research—topics with groups of people already working on them writing up their findings. It’s primarily not an appraisal of potential directions—ways I think areas of research could change or be significantly improved (although I do sometimes comment on directions I’d like to see each area taking). For instance, I think intent alignment is an interesting topic, but the current paucity of publicly available technical writing on it makes it difficult to critique. As such, I think of intent alignment as a “direction” that AI alignment research could be taken in, rather than an “area”.
Papers, not legislation, books, or TV shows. Many intellectual artifacts aside from research papers matter to existential safety, including legislation, as well as fiction and non-fiction books, TV shows, and movies. Such works are not beyond the scope of my opinions, but are beyond this scope of this post.

Progress in OODR will mostly be used to help roll out more AI technologies into active deployment more quickly

It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X). I may be misunderstanding.

A more plausible model is that deployment decisions will be based on many axes of quality, e.g. suppose you deploy when the sum of reliability and speed reaches some threshold Y. If that's the case, then raising reliability will improve the reliability and decrease the speed of deployed systems. If you think that increasing the reliability of AI systems is good (e.g. because AI developers want their AI systems to have various socially desirable properties and are limited by their ability to robustly achieve those properties) then this would be good.

I'm not clear on what part of that picture you disagree with or if you think that this is just small relative to some other risks. My sense is that most of the locally-contrarian views in this post are driven by locally-contrarian quantitative estimates of various risks. If that's the case, then it seems like the main thing that would shift my view would be some argument about the relative magnitude of risks. I'm not sure if other readers feel similarly.

Research in this area usually does not involve deep or lengthy reflections about the structure of society and human values and interactions, which I think makes this field sort of collectively blind to the consequences of the technologies it will help build.

This is a plausible view, but I'm not sure what negative consequences you have in mind (or how it affects the value of progress in the field rather than the educational value of hanging out with people in the field).

Incidentally, the main reason I think OODR research is educationally valuable is that it can eventually help with applying agent foundations research to societal-scale safety. Specifically: how can we know if one of the operations (a)-(f) above is safe to perform 1,000,000 times, given that it was safe the first 1,000 times we applied it in a controlled setting, but the setting is changing over time? This is a special case of an OODR question.

That task---how do we test that this system will consistently have property P, given that we can only test property P at training time?---is basically the goal of OODR research. Your prioritization of OODR suggests that maybe you think that's the "easy part" of the problem (perhaps because testing property P is so much harder), or that OODR doesn't make meaningful progress on that problem (perhaps because the nature of the problem is so different for different properties P?). Whatever it is, it seems like that's at the core of the disagreement and you don't say much about it. I think many people have the opposite intuition, i.e. that much of the expected harm from AI systems comes from behaviors that would have been recognized as problematic at training time.

In any case, I see AI alignment in turn as having two main potential applications to existential safety:

AI alignment is useful as a metaphor for thinking about how to align the global effects of AI technology with human existence, a major concern for AI governance at a global scale, and
AI alignment solutions could be used directly to govern powerful AI technologies designed specifically to make the world safer.

Here is one standard argument for working on alignment. It currently seems plausible that AI systems will be trying to do stuff that no one wants and that this could be very bad if AI systems are much more competent than humans. Prima facie, if the designers of AI systems are able to better control what AI systems are trying to do, then those AI systems are more likely to be trying to do what the developers want. So if we are able to give developers that ability, we can reduce the risk of AI competently doing stuff no one wants.

This isn't really a metaphor, it's a direct path for impact. It's unclear if you think that this argument is mistaken because developers will be able to control what their AI systems are trying to do, because they won't be motivated to deploy AI until they have that control, because it's not much better for AI systems to be trying to do what their developers want, because there are other more important reasons that AI systems could be trying to do stuff that no one wants, because there are other risks unrelated to AI trying to do stuff no one wants, or something else altogether.

(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

Like you, I'm opposed to plans where people try to take over the world in order to make it safer. But this looks like a bit of a leap. For example, AI alignment may help us build powerful AI systems that help us negotiate or draft agreements, which doesn't seem like taking over the world to make it safer.

It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X).

Yes, this is more or less my assumption. I think slower progress on OODR will delay release dates of transformative tech much more than it will improve quality/safety on the eventual date of release.

A more plausible model is that deployment decisions will be based on many axes of quality, e.g. suppose you deploy when the sum of reliability and speed reaches some threshold Y. If that's the case, then raising reliability will improve the reliability and decrease the speed of deployed systems. If you think that increasing the reliability of AI systems is good (e.g. because AI developers want their AI systems to have various socially desirable properties and are limited by their ability to robustly achieve those properties) then this would be good.
I'm not clear on what part of that picture you disagree with or if you think that this is just small relative to some other risks.

Thanks for asking; I do disagree with this! Think reliability is a strongly dominant factor in decisions deploying real-world technology, such that to me it feels roughly-correct to treat it as the only factor. In this way of thinking, which you rightly attribute to me, progress in OODR doesn't improve reliability on deployment-day, it mostly just moves deployment-day a bit earlier in time.

That's not to say I'm advocating being afraid of OODR research because it "shortens timelines", only that I think contributions to OODR are not particularly directly valuable to humanity's long-term fate. As the post emphasizes, if someone cares about existential safety and wants to deploy their professional ambition to reducing x-risk, I think OODR is of high educational value for them to learn about, and as such I would be against "censoring" it as a topic to be discussed here.

One frustration I have with the piece is that I read it as broadly in favour of the empirical distribution of governance demands. The section in the introduction talks of the benefits of legitimizing and fulfilling governance demands, and merely focussing on those demands that are helpful for existential safety. Similarly, I read the section on accountability in ML as broadly having a rhetorical stance that accountability is by default good, altho the recommendation to "help tech company employees and regulators to reflect on the principle of accountability and whether tech companies themselves should be more subject to it at various scales" would, if implemented literally, only promote the forms of accountability that are in fact good.

I'm frustrated by this stance that I infer the text to be taking, because I think that many existing and likely demands for accountability will be unjust and minimally conducive to existential safety. One example of unjust and ineffective accountability is regulatory capture of industries, where regulations tend to be overly lenient for incumbent players that have 'captured' the regulator and overly strict for players that might enter and compete with incumbents. Another is regulations of some legitimate activity by people uninformed about the activity and uninterested in allowing legitimate instances of the activity. My understanding is that most people agree that either regulation of abortions in many conservative US states or regulation of gun ownership in many liberal US states falls into this category. Note my claim is not that there are no legitimate governance demands in these examples, but that actual governance in these cases is unjust and ineffective at promoting legitimate ends, because it is not structured in a way that tends to produce good outcomes.

I am similarly frustrated by this claim:

The European General Data Protection Regulation (GDPR) is a very good step for regulating how tech companies relate with the public. I say this knowing that GDPR is far from perfect. The reason it's still extremely valuable is that it has initialized the variable defining humanity's collective bargaining position (at least within Europe...) for controlling how tech companies use data.

I read this as conflating European humanity with the European Union. I think the correct perspective to take is this: corporate boards keep corporations aligned with some aspects of some segment of humanity, and EU regulation keeps corporations aligned with different aspects of a different segment of humanity. Instead of thinking of this as a qualitative change from 'uncontrolled by humanity' to 'controlled by European humanity', instead I would rather have this be modelled as a change in the controlling structure, and have attention brought to bear on whether the change is in fact good.

Now, for the purpose of enhancing existential safety, I think it likely that any way of growing the set of people who can demand that AI corporations act in a way that serves those people's interests is better than governance purely by a board or employees of the company, because preserving existential safety is a broadly-held value, and outsiders may not be subject to as much bias as insiders about how dangerous the firm's technology is. Indeed, an increase in the size of this set by several orders of magnitude likely causes a qualitative shift. Nevertheless, I don't think there is much reason to think that the details of EU regulation is likely to be closely aligned with the interests of Europeans, and if the GDPR is valuable as a precedent to ensure that the EU can regulate data use, the alignment of the details of this data use is of great importance. As such, I think the structure of this governance is more important to focus on than the number taking part in governance.

In summary:

I hope that technical AI x-risk/existential safety researchers focus on legitimizing and fulfilling those governance and accountability demands that are in fact legitimate.
I hope that discussion of AI governance and accountability does not inhabit a frame in which demands for governance and accountability are reliably legitimate.

This comment is heavily informed by the perspectives that I understand to be advanced in the books The Myth of the Rational Voter, that democracies often choose poor policies because it isn't worth voters' time and effort to learn relevant facts and debias themselves, and The Problem of Political Authority, that democratic governance is often unjust, altho note that I have read neither book.

I also apologize for the political nature of this and the above comment. However, I don't know how to make it less political while still addressing the relevant parts of the post. I also think that the post is really great and thank Critch for writing it, despite the negative nature of the above comment.

My actual thought process for believing GDPR is good is not that it "is a sample from the empirical distribution of governance demands", but that it intializes the process of governments (and thereby the public they represent) weighing in on what tech companies can and cannot design their systems to reason about, and more specifically the degree to which systems are allowed to reason about humans. Having a regulatory structure in place for restricting access to human data is a good first step, but we'll probably also eventually want restrictions for how the systems process the data once they have it (e.g., they probably shouldn't be allowed to use what data they have to come up with ways to significantly deceive or manipulate users).

I'll say the same thing about fairness, in that I value having initialized the process of thinking about it not because it is in the "empirical distribution of governance demands", but because it's a useful governance demand. When things are more fair, people fight less, which is better/safer. I don't mind much that existing fairness research hasn't converged on what I consider "optimal fairness", because I think that consideration is dwarfed by the fact that technical AI researchers are thinking about fairness at all.

That said, while I disagree with your analysis, I do agree with your final position:

I hope that technical AI x-risk/existential safety researchers focus on legitimizing and fulfilling those governance and accountability demands that are in fact legitimate.
I hope that discussion of AI governance and accountability does not inhabit a frame in which demands for governance and accountability are reliably legitimate.

If single/single alignment is solved it feels like there are some salient "default" ways in which we'll end up approaching multi/multi alignment:

Existing single/single alignment techniques can also be applied to empower an organization rather than an individual. So we can use existing social technology to form firms and governments and so on, and those organizations will use AI.
AI systems can themselves participate in traditional social institutions. So AI systems that represent individual human interests can interact with each other e.g. in markets or democracies.

I totally agree that there are many important problems in the world even if we can align AI. That said, I remain interested in more clarity on what you see as the biggest risks with these multi/multi approaches that could be addressed with technical research.

For example, let's take the considerations you discuss under CSC:

Third, unless humanity collectively works very hard to maintain a degree of simplicity and legibility in the overall structure of society*, this “alignment revolution” will greatly complexify our environment to a point of much greater incomprehensibility and illegibility than even today’s world. This, in turn, will impoverish humanity’s collective ability to keep abreast of important international developments, as well as our ability to hold the international economy accountable for maintaining our happiness and existence.

One approach to this problem is to work to make it more likely that AI systems can adequately represent human interests in understanding and intervening on the structure of society. But this seems to be a single/single alignment problem (to whatever extent that existing humans currently try to maintain and influence our social structure, such that impairing their ability to do so is problematic at all) which you aren't excited about.

Fourth, in such a world, algorithms will be needed to hold the aggregate global behavior of algorithms accountable to human wellbeing, because things will be happening too quickly for humans to monitor. In short, an “algorithmic government” will be needed to govern “algorithmic society”. Some might argue this is not strictly unnecessary: in the absence of a mathematically codified algorithmic social contract, humans could in principle coordinate to cease or slow down the use of these powerful new alignment technologies, in order to give ourselves more time to adjust to and govern their use. However, for all our successes in innovating laws and governments, I do not believe current human legal norms are quite developed enough to stably manage a global economy empowered with individually-alignable transformative AI capabilities.

Again, it's not clear what you expect to happen when existing institutions are empowered by AI and mostly coordinate the activities of AI.

The last line reads to me like "If we were smarter, when our legal system may no longer be up to the challenge," with which I agree. But it seems like the main remedy is "if we were smarter, we would hopefully work on improving our legal system in tandem with the increasing demands we impose on it."

It feels like the salient actions to take to me are (i) make direct improvements in the relevant institutions, in a way that anticipates the changes brought about by AI but will most likely not look like AI research, (ii) work on improving the relative capability of AI at those tasks that seem more useful for guiding society in a positive direction.

I consider (ii) to be one of the most important kinds of research other than alignment for improving the impact of AI, and I consider (i) to be all-around one of the most important things to do for making the world better. Neither of them feels much like CSC (e.g. I don't think computer scientists are the best people to do them) and it's surprising to me that we end up at such different places (if only in framing and tone) from what seem like similar starting points.

> Third, unless humanity collectively works very hard to maintain a degree of simplicity and legibility in the overall structure of society*, this “alignment revolution” will greatly complexify our environment to a point of much greater incomprehensibility and illegibility than even today’s world. This, in turn, will impoverish humanity’s collective ability to keep abreast of important international developments, as well as our ability to hold the international economy accountable for maintaining our happiness and existence.
One approach to this problem is to work to make it more likely that AI systems can adequately represent human interests in understanding and intervening on the structure of society. But this seems to be a single/single alignment problem (to whatever extent that existing humans currently try to maintain and influence our social structure, such that impairing their ability to do so is problematic at all) which you aren't excited about.

Yes, you've correctly anticipated my view on this. Thanks for the very thoughtful reading!

To elaborate: I claim "turning up the volume" on everyone's individual agency (by augmenting them with user-aligned systems) does not automatically make society overall healthier and better able to survive, and in fact it might just hasten progress toward an unhealthy or destructive outcome. To me, the way to avoid this is not to make the aligned systems even more aligned with their users, but to start "aligning" them with the rest of society. "Aligning" with society doesn't just mean "serving" society, it means "fitting into it", which means the AI system needs to have a particular structure (not just a particular optimization objective) that makes it able to exist and function safely inside a larger society. The desired structure involves features like being transparent, legibly beneficial, and legibly fair. Without those aspects, I think your AI system introduces a bunch of political instability and competitive pressure into the world (e.g., fighting over disagreements about what it's doing or whether it's fair or whether it will be good), which I think by default turns up the knob on x-risk rather than turning it down. For a few stories somewhat-resembling this claim, see my next post:

https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic

Of course, if you make a super-aligned self-modifying AI, it might immediately self-modify so that its structure is more legibly beneficial and fair, because of the necessity (if I'm correct) of having that structure for benefitting society and therefore its creators/users. However, my preferred approach to building societally-compatible AI is not to make societally-incompatible AI systems and hope that they know their users "want" them to transform into more societally-compatible systems. I think we should build highly societally-compatible systems to begin with, not just because it seems broadly "healthier", but because I think it's necessary for getting existential risk down to tolerable levels like <3% or <1%. Moreover, because this view seems misunderstood by x-safety enthusiasts, I currently put the plurality of my existential-failure probability on outcomes arising from problems other than individual systems being misaligned (in terms of the objective) with the users or creators. Dafoe et al would call this "structural risk", which I find to be a helpful framing that should be applied not only to the structure of society external to the AI system, but also the system's internal structure.

That said, I remain interested in more clarity on what you see as the biggest risks with these multi/multi approaches that could be addressed with technical research.

A (though not necessarily the most important) reason to think technical research into computational social choice might be useful is that examining specifically the behaviour of RL agents from a computational social choice perspective might alert us to ways in which coordination with future TAI might be similar or different to the existing coordination problems we face.

(i) make direct improvements in the relevant institutions, in a way that anticipates the changes brought about by AI but will most likely not look like AI research,

It seems premature to say, in advance of actually seeing what such research uncovers, whether the relevant mechanisms and governance improvements are exactly the same as the improvements we need for good governance generally, or different. Suppose examining the behaviour of current RL agents in social dilemmas leads to a general result which in turn leads us to conclude there's a disproportionate chance TAI in the future will coordinate in some damaging way that we can resolve with a particular new regulation. It's always possible to say, solving the single/single alignment problem will prevent anything like that from happening in the first place, but why put all your hopes on plan A, when plan B is relatively neglected?

It's always possible to say, solving the single/single alignment problem will prevent anything like that from happening in the first place, but why put all your hopes on plan A, when plan B is relatively neglected?

The OP writes "contributions to AI alignment are also generally unhelpful to existential safety." I don't think I'm taking a strong stand in favor of putting all our hopes on plan A, I'm trying to understand the perspective on which plan B is much more important even before considering neglectedness.

It seems premature to say, in advance of actually seeing what such research uncovers, whether the relevant mechanisms and governance improvements are exactly the same as the improvements we need for good governance generally, or different.

I agree that would be premature. That said, I still found it notable that OP saw such a large gap between the importance of CSC and other areas on and off the list (including MARL). Given that I would have these things in a different order (before having thought deeply), it seemed to illustrate a striking difference in perspective. I'm not really trying to take a strong stand, just using it to illustrate and explore that difference in perspective.

Among other things, this post promotes the thesis that (single/single) AI alignment is insufficient for AI existential safety and the current focus of the AI risk community on AI alignment is excessive. I'll try to recap the idea the way I think of it.

We can roughly identify 3 dimensions of AI progress: AI capability, atomic AI alignment and social AI alignment. Here, atomic AI alignment is the ability to align a single AI system with a single user, whereas social AI alignment is the ability to align the sum total of AI systems with society as a whole. Depending on the relative rates at which those 3 dimensions develop, there are roughly 3 possible outcomes (ofc in reality it's probably more of a spectrum):

Outcome A: The classic "paperclip" scenario. Progress in atomic AI alignment doesn't keep up with progress in AI capability. Transformative AI is unaligned with any user, as a result the future contains virtually nothing of value to us.

Outcome B: Progress in atomic AI alignment keeps up with progress in AI capability, but progress in social AI alignment doesn't keep up. Transformative AI is aligned with a small fraction of the population, resulting in this minority gaining absolute power and abusing it to create an extremely inegalitarian future. Wars between different factions are also a concern.

Outcome C: Both atomic and social alignment keep with with AI capability. Transformative AI is aligned with society/humanity as a whole, resulting in a benevolent future for everyone.

Ideally, Outcome C is the outcome we want (with the exception of people who decided to gamble on being part of the elite in outcome B). Arguably, C > B > A (although it's possible to imagine scenarios in which B < A). How does it translate into research priorities? This depends on several parameters:

The "default" pace of progress in each dimension: e.g. if we assume atomic AI alignment will be solved in time anyway, then we should focus on social AI alignment.
The inherent difficulty of each dimension: e.g. if we assume atomic AI alignment is relatively hard (and will therefore take a long time to solve) whereas social AI alignment becomes relatively easy once atomic AI alignment is solved, then we should focus on atomic AI alignment.
The extent to which each dimension depends on others: e.g. if we assume it's impossible to make progress in social AI alignment without reaching some milestone in atomic AI alignment, then we should focus on atomic AI alignment for now. Similarly, some argued we shouldn't work on alignment at all before making more progress in capability.
More precisely, the last two can be modeled jointly as the cost of marginal progress in a given dimension as a function of total progress in all dimensions.
The extent to which outcome B is bad for people not in the elite: If it's not too bad then it's more important to prevent outcome A by focusing on atomic AI alignment, and vice versa.

The OP's conclusion seems to be that social AI alignment should be the main focus. Personally, I'm less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.

Outcome B: Progress in atomic AI alignment keeps up with progress in AI capability, but progress in social AI alignment doesn't keep up. Transformative AI is aligned with a small fraction of the population, resulting in this minority gaining absolute power and abusing it to create an extremely inegalitarian future. Wars between different factions are also a concern.

It's unclear to me how this particular outcome relates to social alignment (or at least to the kinds of research areas in this post). Some possibilities:

Does failure to solve social alignment mean that firms and governments cannot use AI to represent their shareholders and constituents? Why might that be? (E.g. what's a plausible approach to atomic alignment that couldn't be used by a firm or government?)
Does AI progress occur unevenly such that some group gets much more power/profit, and then uses that power? If so, how would technical progress on alignment help address that outcome? (Why would the group with power be inclined to use whatever techniques we're imagining?) Also, why does this happen?
Does AI progress somehow complicate the problem of governance or corporate governance such that those organizations can no longer represent their constituents/shareholders? What is the mechanism (or any mechanism) by which this happens? Does social alignment help by making new forms of organization possible, and if so should I just be thinking of it as a way of improving those institutions, or is it somehow distinctive?
Do we already believe that the situation is gravely unequal (e.g. because governments can't effectively represent their constituents and most people don't have a meaningful amount of capital) and AI progress will exacerbate that situation? How does social alignment prevent that?

(This might make more sense as a question for the OP, it just seemed easier to engage with this comment since it describes a particular more concrete possibility. My sense is that the OP may be more concerned about failures in which no one gets what they want rather than outcome B per se.)

Outcome C is most naturally achieved using "direct democracy" TAI, i.e. one that collects inputs from everyone and aggregates them in a reasonable way. We can try emulating democratic AI via single user AI, but that's hard because:

If the number of AIs is small, the AI interface becomes a single point of failure, an actor that can hijack the interface will have enormous power.
If the number of AIs is small, it might be unclear what inputs should be fed into the AI in order to fairly represent the collective. It requires "manually" solving the preference aggregation problem, and faults of the solution might be amplified by the powerful optimization to which it is subjected.
If the number of AIs is more than one then we should make sure the AIs are good at cooperating, which requires research about multi-AI scenarios.
If the number of AIs is large (e.g. one per person), we need the interface to be sufficiently robust that people can use it correctly without special training. Also, this might be prohibitively expensive.

Designing democratic AI requires good theoretical solutions for preference aggregation and the associated mechanism design problem, and good practical solutions for making it easy to use and hard to hack. Moreover, we need to get the politicians to implement those solutions. Regarding the latter, the OP argues that certain types of research can help lay the foundation by providing actionable regulation proposals.

My sense is that the OP may be more concerned about failures in which no one gets what they want rather than outcome B per se

Well, the OP did say:

(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

I understood it as hinting at outcome B, but I might be wrong.

Outcome C is most naturally achieved using "direct democracy" TAI, i.e. one that collects inputs from everyone and aggregates them in a reasonable way. We can try emulating democratic AI via single user AI, but that's hard because:

I'm not sure what's most natural, but I do consider this a fairly unlikely way of achieving outcome C.

I think the best argument for this kind of outcome is from Wei Dai, but I don't think it gets you close to the "direct democracy" outcome. (Even if you had state control and AI systems aligned with the state, it seems unlikely and probably undesirable for the state to be replaced with an aggregation procedure implemented by the AI itself.)

A lot depends on AI capability as a function of cost and time. On one extreme, there might enough rising returns to get a singleton: some combination of extreme investment and algorithmic advantage produces extremely powerful AI, moderate investment or no algorithmic advantage doesn't produce moderately powerful AI. Whoever controls the singleton has all the power. On the other extreme, returns don't rise much, resulting in personal AIs having as much or more collective power as corporate/government AIs. In the middle, there are many powerful AIs but still not nearly as many as people.

In the first scenario, to get outcome C we need the singleton to either be democratic by design, or have a very sophisticated and robust system of controlling access to it.

In the last scenario, the free market would lead to outcome B. Corporate and government actors use their access to capital to gain power through AI until the rest of the population becomes irrelevant. Effectively, AI serves as an extreme amplifier of per-existing power differentials. Arguably, the only way to get outcome C is enforcing democratization of AI through regulation. If this seems extreme, compare it to the way our society handles physical violence. The state has monopoly on violence, and with good reason: without this monopoly, upholding the law would be impossible. But, in the age of superhuman AI, traditional means of violence are irrelevant. The only important weapon is AI.

In the second scenario, we can manage without multi-user alignment. However, we still need to have multi-AI alignment, i.e. make sure the AIs are good at coordination problems. It's possible that any sufficiently capable AI is automatically good at coordination problems, but it's not guaranteed. (Incidentally, if atomic alignment is flawed then it might be actually better for the AIs to be bad at coordination.)

The OP's conclusion seems to be that social AI alignment should be the main focus. Personally, I'm less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.

Thanks for the feedback, Vanessa. I've just written a follow-up post to better illustrate a class of societal-scale failure modes ("unsafe robust agent-agnostic processes") that constitutes the majority of the probability mass I currently place on human extinction precipitated by transformative AI advancements (especially AGI, and/or high-level machine intelligence in the language of Grade et al). Here it is:

https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic

I'd be curious to see if it convinces you that what you call "social alignment" should be our main focus, or at least a much greater focus than currently.

with the exception of people who decided to gamble on being part of the elite in outcome B

Game-theoretically, there's a better way. Assume that after winning the AI race, it is easy to figure out everyone else's win probability, utility function and what they would do if they won. Human utility functions have diminishing returns, so there's opportunity for acausal trade. Human ancestry gives a common notion of fairness, so the bargaining problem is easier than with aliens.

Most of us care some even about those who would take all for themselves, so instead of giving them the choice between none and a lot, we can give them the choice between some and a lot - the smaller their win prob, the smaller the gap can be while still incentivizing cooperation.

Therefore, the AI race game is not all or nothing. The more win probability lands on parties that can bargain properly, the less multiversal utility is burned.

Good point, acausal trade can at least ameliorate the problem, pushing towards atomic alignment. However, we understand acausal trade too poorly to be highly confident it will work. And, "making acausal trade work" might in itself be considered outside of the desiderata of atomic alignment (since it involves multiple AIs). Moreover, there are also actors that have a very low probability of becoming TAI users but whose support is beneficial for TAI projects (e.g. small donors). Since they have no counterfactual AI to bargain on their behalf, it is less likely acausal trade works here.

Yeah, I basically hope that enough people care about enough other people that some of the wealth ends up trickling down to everyone. Win probability is basically interchangeable with other people caring about you and your ressources across the multiverse. Good thing the cosmos is so large.

I don't think making acausal trade work is that hard. All that is required is:

That the winner cares about the counterfactual versions of himself that didn't win, or equivalently, is unsure whether they're being simulated by another winner. (huh, one could actually impact this through memetic work today, though messing with people's preferences like that doesn't sound friendly)
That they think to simulate alternate winners before they expand too far to be simulated.

I'd like more discussion of the claim that alignment research is unhelpful-at-best for existential safety because of it accelerating deployment. It seems to me that alignment research has a couple paths to positive impact which might balance the risk:

Tech companies will be incentivized to deploy AI with slipshod alignment, which might then take actions that no one wants and which pose existential risk. (Concretely, I'm thinking of out with a whimper and out with a bang scenarios.) But the existence of better alignment techniques might legitimize governance demands, i.e. demands that tech companies don't make products that do things that literally no one wants.
Single/single alignment might be a prerequisite to certain computational social choice solutions. E.g., once we know how to build an agent that "does what [human] wants", we can then build an agent that "helps [human 1] and [human 2] draw up incomplete contracts for mutual benefit subject to the constraints in the [policy] written by [human 3]". And slipshod alignment might not be enough for this application.

I'd believe the claim if I thought that alignment was easy enough that AI products that pass internal product review and which don't immediately trigger lawsuits would be aligned enough to not end the world through alignment failure. But I don't think that's the case, unfortunately.

It seems like we'll have to put special effort into both single/single alignment and multi/single "alignment", because the free market might not give it to us.

(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

Wait, you want to do it the hard way? Not only win the AI race with enough head start for safety, but stop right at the finish line and have everyone else stop at the finish line? However would you prevent everyone everywhere from going over? If you manage to find a way, that sounds like taking over the world with extra steps.

So I agree with Paul's comment that there's another motivation for work on preference learning besides the two you identify. But even if I take on what I believe on your views of the risks, it seems like there is something very close to preference learning that is still helpful to existential safety. I have sometimes called it the specification problem: given a desired behavior, how do you provide a training signal to an AI system such that it is incentivized to behave that way? Typical approaches include imitation learning, learning from comparisons / preferences, learning from corrections, etc.

Before I explain why I think this should be useful even on your views, let me try clarifying the field more. Looking at the papers you list as exemplars of PL:

(2017) Deep reinforcement learning from human preferences, Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario.
(2018) Reward learning from human preferences and demonstrations in Atari, Ibarz, Borja; Leike, Jan; Pohlen, Tobias; Irving, Geoffrey; Legg, Shane; Amodei, Dario.
(2018) The alignment problem for Bayesian history-based reinforcement learners, Everitt, Tom; Hutter, Marcus.
(2019) Learning human objectives by evaluating hypothetical behavior, Reddy, Siddharth; Dragan, Anca D; Levine, Sergey; Legg, Shane; Leike, Jan.
(2019) On the feasibility of learning, rather than assuming, human biases for reward inference, Shah, Rohin; Gundotra, Noah; Abbeel, Pieter; Dragan, Anca D.
(2020) Reward-rational (implicit) choice: A unifying formalism for reward learning, Jeon, Hong Jun; Milli, Smitha; Dragan, Anca D.

I think the first, second, fourth and sixth are clear exemplars of work tackling the specification problem (whether or not the authors would put it that way themselves). The third is unclear (I wouldn't have put it in PL, nor with the specification problem, though I might be forgetting what's in it). The fifth is mostly PL and less about the specification problem; I am less excited about that paper as a result.

Okay, so why should this be useful even on (my model of) your views? You say that you want to anticipate, legitimize and fulfill governance demands. I see the combination of <specification problem field> and OODR as one of the best ways of fulfilling governance demands (which can then be used to legitimize them in advance, if you are able to anticipate them). In particular, most governance demands will look like "please make your AI systems satisfy property P", where P is some phrase in natural language that's fuzzy and can't immediately be grounded (for example, fairness). It seems to me that given such a demand, a natural way of solving it is to figure out which behaviors do and don't satisfy P, and then use your solutions to the specification problem to incentivize your AI system to satisfy P, and then use OODR to ensure that they actually satisfy P in all situations. I expect this to work in the next decade to e.g. ensure that natural language systems almost never deceive people into thinking they are human.

A number of blogs seem to treat [AI existential safety, AI alignment, and AI safety] as near-synonyms (e.g., LessWrong, the Alignment Forum), and I think that is a mistake, at least when it comes to guiding technical work for existential safety.

I strongly agree with the benefits of having separate terms and generally like your definitions.

In this post, AI existential safety means “preventing AI technology from posing risks to humanity that are comparable or greater than human extinction in terms of their moral significance.”

I like "existential AI safety" as a term to distinguish from "AI safety" and agree that it seems to be clearer and have more staying power. (That said, it's a bummer that "AI existential safety forum" is a bit of a mouthful.)

If I read that term without a definition I would assume it meant "reducing the existential risk posed by AI." Hopefully you'd be OK with that reading. I'm not sure if you are trying to subtly distinguish it from Nick's definition of existential risk or if the definition you give is just intended to be somewhere in that space of what people mean when they say "existential risk" (e.g. the LW definition is like yours).

Good to hear!

If I read that term ["AI existential safety"] without a definition I would assume it meant "reducing the existential risk posed by AI." Hopefully you'd be OK with that reading. I'm not sure if you are trying to subtly distinguish it from Nick's definition of existential risk or if the definition you give is just intended to be somewhere in that space of what people mean when they say "existential risk" (e.g. the LW definition is like yours).

Yep, that's my intention. If given the chance I'd also shift the meaning of "existential risk" a bit away from Bostrom's and a bit toward a more naive meaning of the term, but that's a separate objective :) Specifically, if I got to rewrite Nick's terminology (which might be too late now that it's on Wikipedia), I'd say "existential risk" should mean "risk to the existence of humanity" and "existential-level risk" should mean "risks that are as morally significant as risks to the existence of humanity" (which, roughly speaking, is what Bostrom currently calls "existential risk").

Curated, for several reasons.

I think it's really hard to figure out how to help with beneficial AI. Various career and research paths vary in how likely they are to help, or harm, or fit together. I think many prominent thinkers in the AI landscape have developed nuanced takes on how to think about the evolving landscape, but often haven't written up those thoughts.

I like this post both for laying out a lot of object-level thoughts about that, and also for demonstrating a possible framework for organizing those object-level thoughts, and for doing it very comprehensively.

I haven't finished processing all of the object level points and am not sure which ones I endorse at this point. But I'm looking forward to debate on the various points here. I'd welcome other thinkers in the AI Existential Safety space writing up similarly comprehensive posts about how they think about all of this.

Thanks for this long and very detailed post!

The MARL projects with the greatest potential to help are probably those that find ways to achieve cooperation between decentrally trained agents in a competitive task environment, because of its potential to minimize destructive conflicts between fleets of AI systems that cause collateral damage to humanity. That said, even this area of research risks making it easier for fleets of machines to cooperate and/or collude at the exclusion of humans, increasing the risk of humans becoming gradually disenfranchised and perhaps replaced entirely by machines that are better and faster at cooperation than humans.

In ARCHES, you mention that just examining the multiagent behaviour of RL systems (or other systems that work as toy/small-scale examples of what future transformative AI might look like) might enable us to get ahead of potential multiagent risks, or at least try to predict how transformative AI might behave in multiagent settings. The way you describe it in ARCHES, the research would be purely exploratory,

One approach to this research area is to continually ex-amine social dilemmas through the lens of whatever is the leading AI devel-opment paradigm in a given year or decade, and attempt to classify interest-ing behaviors as they emerge. This approach might be viewed as analogousto developing “transparency for multi-agent systems”: first develop inter-esting multi-agent systems, and then try to understand them.

But what you're suggesting in this post, 'those that find ways to achieve cooperation between decentrally trained agents in a competitive task environment', sounds like combining computational social choice research with multiagent RL - examining the behaviour of RL agents in social dilemmas and trying to design mechanisms that work to produce the kind of behaviour we want. To do that, you'd need insights from social choice theory. There is some existing research on this, but it's sparse and very exploratory.

OpenAI just released a paper on RL agents in social Dilemmas, https://arxiv.org/pdf/2011.05373v1.pdf and there is some previous work. This is more directly multiagent RL, but there is some consideration for things like choosing the right overall social welfare metric.
There are also two papers examining bandit algorithms in iterated voting scenarios, https://hal.archives-ouvertes.fr/hal-02641165/document and https://www.irit.fr/~Umberto.Grandi/scone/Layka.m2.pdf.

My current research is attempting to build on the second of these.

As far as I can tell, that's more or less it in terms of examining RL agents in social dilemmas, so there may well be a lot of low-hanging fruit and interesting discoveries to be made. If the research is specifically about finding ways of achieving cooperation in multiagent systems by choosing the correct (e.g. voting) mechanism, is that not also computational social choice research, and therefore of higher priority by your metric?

In short, computational social choice research will be necessary to legitimize and fulfill governance demands for technology companies (automated and human-run companies alike) to ensure AI technologies are beneficial to and controllable by human society.
...

CSC neglect:
As mentioned above, I think CSC is still far from ready to fulfill governance demands at the ever-increasing speed and scale that will be needed to ensure existential safety in the wake of “the alignment revolution”.

Great post!

I suppose you'll be more optimistic about Single/Single areas if you update towards fast/discontinuous takeoff?

This post is amazing. Both for me as a researcher, and for the people I know that want to contribute to AI existential safety. Just last week, a friend asked what he should try to do his PhD in AI/ML on, if he wants to contribute to AI existential safety. I mentioned interpretability, but now I have somewhere to redirect him.

As for my own thinking, I value immensely the attempt to say what is in the right direction even in technical research like AI Alignment. Most people in this area are here for helping AI existential Safety, but even after deciding to go into the field, the question of relevance of specific research ideas should be asked. I'm more into agent foundations kind of stuff, but even there, as you argue, one can look for consequences of success on AI existential safety.

The main way I can see present-day technical research benefitting existential safety is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise over the next 10-30 years. In short, there often needs to be some amount of traction on a technical area before it’s politically viable for governing bodies to demand that institutions apply and improve upon solutions in those areas.

Great way to think about the value of some research! I would probably add "creating", because some governance demands come from technical study finding potential issues we need to deal with. Also, I really would love to see a specific post on this take, or a question; really anything that doesn't require precommitting to read a long post on a related subject.

Regarding the two ways you enumerate in which AI alignment could serve to further existential safety, I think a third, more viable, way is missing:

AI alignment solutions allow humans to build powerful AI systems that behave as planned without compromising existential safety.

I presume that it is desirable to build powerful AI systems - either to do object-level useful things, or to help humanity regulate other AI systems. There is a family of arguments that I associate with Bostrom and Yudkowsky that it is difficult to align such powerful AI systems that are aligned with what their creator wants them to do, either for 'outer alignment' reasons of difficulty in objective specification, or for 'inner alignment' reasons of inherent difficulties in optimization. This family of arguments also advances the idea that such alignment failures can have consequences that compromise existential safety. If you believe these arguments, then it appears to me that AI alignment solutions are necessary, but not sufficient, for existential safety.

Planned summary for the Alignment Newsletter:

This long post explains the author’s beliefs about a variety of research topics relevant to AI existential safety. First, let’s look at some definitions.
While AI safety alone just means getting AI systems to avoid risks (including e.g. the risk of a self-driving car crashing), _AI existential safety_ means preventing AI systems from posing risks at least as bad as human extinction. _AI alignment_ on the other hand is about getting an AI system to try to / succeed at doing what a person or institution wants them to do. (The “try” version is _intent alignment_, while the “succeed” version is _impact alignment_.)
Note that AI alignment is not the same thing as AI existential safety. In addition, the author makes the stronger claim that it is insufficient to guarantee AI existential safety, because AI alignment tends to focus on situations involving a single human and a single AI system, whereas AI existential safety requires navigating systems involving multiple humans and multiple AI systems. Just as AI alignment researchers worry that work on AI capabilities for useful systems doesn’t engage enough with the difficulty of alignment, the author worries that work on alignment doesn’t engage enough with the difficulty of multiagent systems.
The author also defines _AI ethics_ as the principles that AI developers and systems should follow, and _AI governance_ as identifying _and enforcing_ norms for AI developers and systems to follow. While ethics research may focus on resolving disagreements, governance will be more focused on finding agreeable principles and putting them into practice.
Let’s now turn to how to achieve AI existential safety. The main mechanism the author sees is to _anticipate, legitimize, and fulfill governance demands_ for AI technology. Roughly, governance demands are those properties which there are social and political pressures for, such as “AI systems should be fair” or “AI systems should not lead to human extinction”. If we can _anticipate_ these demands in advance, then we can do technical work on how to _fulfill_ or meet these demands, which in turn _legitimizes_ them, that is, it makes it clearer that the demand can be fulfilled and so makes it easier to create common knowledge that it is likely to become a legal or professional standard.
We then turn to various different fields of research, which the author ranks on three axes: helpfulness to AI existential safety (including potential negative effects), educational value, and neglectedness. Note that for educational value, the author is estimating the benefits of conducting research on the topic _to the researcher_, and not to (say) the rest of the field. I’ll only focus on helpfulness to AI existential safety below, since that’s what I’m most interested in (it’s where the most disagreement is, and so where new arguments are most useful), but I do think all three axes are important.
The author ranks both preference learning and out of distribution robustness lowest on helpfulness to existential safety (1/10), primarily because companies already have a strong incentive to have robust AI systems that understand preferences.
Multiagent reinforcement learning (MARL) comes only slightly higher (2/10), because since it doesn’t involve humans its main purpose seems to be to deploy fleets of agents that may pose risks to humanity. It is possible that MARL research could help by producing <@cooperative agents@>(@Cooperative AI Workshop@), but even this carries its own risks.
Agent foundations is especially dual-use in this framing, because it can help us understand the big multiagent system of interactions, and there isn’t a restriction on how that understanding could be used. It consequently gets a low score (3/10), that is a combination of “targeted applications could be very useful” and “it could lead to powerful harmful forces”.
Minimizing side effects starts to address the challenges the author sees as important (4/10): in particular, it can allow us both to prevent accidents, where an AI system “messes up”, and it can help us prevent externalities (harms to people other than the primary stakeholders), which are one of the most challenging issues in regulating multiagent systems.
Fairness is valuable for the obvious reason: it is a particular governance demand that we have anticipated, and research on it now will help fulfill and legitimize that demand. In addition, research on fairness helps get people to think at a societal scale, and to think about the context in which AI systems are deployed. It may also help prevent centralization of power from deployment of AI systems, since that would be an unfair outcome.
The author would love it if AI/ML pivoted to frequently think about real-life humans and their desires, values and vulnerabilities. Human-robot interaction (HRI) is a great way to cause more of that to happen, and that alone is valuable enough that the author assigns it 6/10, tying it with fairness.
As we deploy more and more powerful AI systems, things will eventually happen too quickly for humans to monitor. As a result, we will need to also automate the process of governance itself. The area of computational social choice is well-posed to make this happen (7/10), though certainly current proposals are insufficient and more research is needed.
Accountability in ML is good (8/10) primarily because as we make ML systems accountable, we will likely also start to make tech companies accountable, which seems important for governance. In addition, in a <@CAIS@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@) scenario, better accountability mechanisms seem likely to help in ensuring that the various AI systems remain accountable, and thus safer, to human society.
Finally, interpretability is useful (8/10) for the obvious reasons: it allows developers to more accurately judge the properties of systems they build, and helps in holding developers and systems accountable. But the most important reason may be that interpretable systems can make it significantly easier for competing institutions and nations to establish cooperation around AI-heavy operations.

Planned opinion:

I liked this post: it’s a good exploration of what you might do if your goal was to work on technical approaches to future governance challenges; that seems valuable and I broadly agree with it (though I did have some nitpicks in [this comment](https://www.alignmentforum.org/posts/hvGoYXi2kgnS3vxqb/some-ai-research-areas-and-their-relevance-to-existential-1?commentId=LjvvW3xddPTXaequB)).
There is then an additional question of whether the best thing to do to improve AI existential safety is to work on technical approaches to governance challenges. There’s some pushback on this claim in the comments that I agree with; I recommend reading through it. It seems like the core disagreement is on the relative importance of risks: in particular, it sounds like the author thinks that existing incentives for preference learning and out-of-distribution robustness are strong enough that we mostly don’t have to worry about it, whereas governance will be much more challenging; I disagree with at least that relative ranking.
It’s possible that we agree on the strength of existing incentives -- I’ve <@claimed@>(@Conversation with Rohin Shah@) a risk of 10% for existentially bad failures of intent alignment if there is no longtermist intervention, primarily because of existing strong incentives. That could be consistent with this post, in which case we’d disagree primarily on whether the “default” governance solutions are sufficient for handling AI risk, where I’m a lot more optimistic than the author.

My quick two-line review is something like: this post (and its sequel) is an artifact from someone with an interesting perspective on the world looking at the whole problem and trying to communicate their practical perspective. I don't really share this perspective, but it is looking at enough of the real things, and differently enough to the other perspectives I hear, that I am personally glad to have engaged with it. +4.

I see CSC and SEM as highly linked via modularity of processes.

This is an excellent post, thank you a lot for it. Below an assortment of remarks and questions.

The table is interesting. Here's my attempt at estimating the values for helpfulness, educational value and 2022 neglect (epistemic effort: I made this up on the spot from intuitions):

OOD robustness: 2/10, 4/10, 2/10
Agent foundations: 5/10, 7/10, 9/10
Multi-agent RL: 1/10, 4/10, 2/10
Preference learning: 8/10, 7/10, 2/10
Side-effect minimization: 7/10, 5/10, 4/10
Human-robot interaction: 1/10, 1/10, 2/10
Interpretability: 9/10, 8/10, 3/10
Fairness in ML: 2/10, 4/10, 1/10
Computational social choice: 3/10, 5/10, 2/10
Accountability in ML: 6/10, 2/10, 3/10

Furthermore:

Contributions to preference learning are not particularly helpful to existential safety in my opinion, because their most likely use case is for modeling human consumers just well enough to create products they want to use and/or advertisements they want to click on. Such advancements will be helpful to rolling out usable tech products and platforms more quickly, but not particularly helpful to existential safety.*
*) I hope no one will be too offended by this view. I did have some trepidation about expressing it on the “alignment’ forum, but I think I should voice these concerns anyway, for the following reason.

While I don't put a lot of probability on this view, I'm glad it was expressed here.

One might want to make a distinction between different layers in the Do-What-I-Mean hierarchy, I believe this text is mostly talking about the first/second layer of the hierarchy (excluding "zero DWIMness"). Perhaps there could be additional risks from companies having a richer understanding of human preferences and their distinction from biases?

However, the need to address multilateral externalities will arise very quickly after unilateral externalities are addressed well enough to roll out legally admissible products, because most of our legal systems have an easier time defining and punishing negative outcomes that have a responsible party. I don’t believe this is a quirk of human legal systems: when two imperfectly aligned agents interact, they complexify each other’s environment in a way that consumes more cognitive resources than interacting with a non-agentic environment.

I have a hard time coming up with an example that distinguishes unilateral externalities from a multilateral externalities, what would be an example? The ARCHES paper doesn't contain the string "multilateral externalit".

Thinking out loud: If we have three countries A, B and C, all of which have a coast to the same ocean, and A and B release two different chemicals N and M that are harmless each on their own, but in combination create a toxic compound N₂M₃ that harms citizens of all three countries equally, is that a multilateral externality?

Trying to generalise, is any situation where either two or more actors harm other actors, or one or more actors harm two or more other actors, through polluting a shared resource, a multilateral externality?

Two more (small) questions:

Is "translucent game theory" the same as "open-source game theory"?
You say you prefer talking about papers, but do you by chance have a recommendation for CSC textbooks? The handbook you link doesn't have exercises, if I remember correctly.

For the preference learning skepticism, does this extend to the research direction (that isn't yet a research area) of modelling long term preferences/preferences on reflection? This is more along the lines of the "AI-assisted deliberation" direction from ARCHES.

To me it seems like AI alignment that can capture preferences on reflection could be used to find solutions to many of other problems. Though there are good reasons to expect that we'd still want to do other work (because we might need theoretical understanding and okay solutions before AI reaches the point where it can help on research, because we want to do work ourselves to be able to check solutions that AIs reach, etc.)

It also seems like areas like FairML and Computational Social Choice will require preference learning as components - my guess is that people's exact preferences about fairness won't have a simple mathematical formulation, and will instead need to be learned. I could buy the position that the necessary progress in preference learning will happen by default because of other incentives.

Nice post! In particular, I like your reasoning about picking research topics:

The main way I can see present-day technical research benefiting existential safety is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise over the next 10-30 years. In short, there often needs to be some amount of traction on a technical area before it’s politically viable for governing bodies to demand that institutions apply and improve upon solutions in those areas.

I like this as a guiding principle, and have used it myself, though my choices have also been driven in part by more open-ended scientific curiosity. But when I apply the above principle, I get to quite different conclusions about recommended research areas.

As a specific example, take the problem of oversight of companies that want to create of deploy strong AI: the problem of getting to a place where society has accepted and implemented policy proposals that demand significant levels of oversight for such companies. In theory, such policy proposals might be held back by a lack of traction in a particular technical area, but I do not believe this is a significant factor in this case.

To illustrate, here are some oversight measures that apply right now to companies that create medical equipment, including diagnostic equipment that contains AI algorithms. (Detail: some years ago I used to work in such a company.) If the company wants to release any such medical technology to the public, it has to comply with a whole range of requirements about documenting all steps taken in development and quality assurance. A significant paper trail has to be created, which is subject to auditing by the regulator. The regulator can block market entry if the processes are not considered good enough. Exactly the same paper trail + auditing measures could be applied to companies that develop powerful non-medical AI systems that interact with the public. No technical innovation would be necessary to implement such measures.

So if any activist group or politician wants to propose measures to improve oversight of AI development and use by companies (either motivated by existential safety risks or by a more general desire to create better outcomes in society), there is no need for them to wait for further advances in Interpretability in ML (IntML), Fairness in ML (FairML) or Accountability in ML (AccML) techniques.

To lower existential risks from AI, it is absolutely necessary to locate proposals for solutions which are technically tractable. But to find such solutions, one must also look at low-tech and different-tech solitions that go beyond the application of even more AI research. The existence of tractable alternative solutions to make massive progress leads me to down-rank the three AI research areas I mention above, at least when considered from a pure existential safety perspective. The non-existence of alternatives also leads me to up-rank other areas (like corrigibility) which are not even mentioned in the original post.

I like the idea of recommending certain fields for their educational value to existential-safety-motivated researchers. However, I would also recommend that such researchers read broadly beyond the CS field, to read about how other high-risk fields are managing (or have failed to manage) to solve their safety and governance problems.

I believe that the most promising research approach for lowering AGI safety risk is to find solutions that combine AI research specific mechanisms with more general mechanisms from other fields, like the use of certain processes which are run by humans.

I've highly voted this post for a few reasons.

First, this post contains a bunch of other individual ideas I've found quite helpful for orienting. Some examples:

Useful thoughts on which term definitions have "staying power," and are worth coordinating around.
The zero/single/multi alignment framework.
The details on how to anticipate legitimize and fulfill governance demands.

But my primary reason was learning Critch's views on what research fields are promising, and how they fit into his worldview. I'm not sure if I agree with Critch, but I think "Figure out what are the best research directions to navigate towards" seems crucially important. Having senior senior AI x-risk researchers to lay out how they think about what research is valuable.

I'd like to see similar posts from Paul, Eliezer, etc, (which I expect to have radically different frames). I don't expect everyone to end up converging on a single worldview, but I think the process of smashing the worldviews together can generate useful ideas, and give up-and-coming-researchers some hooks of what to explore.

One confusing here is that the initial table doesn't distinguish between "fields that aren't that helpful for existential safety" and "fields which are both helpful-and-harmful to existential safety." I was surprised when I looked at the initial Agent Foundations ranking of "3" which turned out to be much more complex.

Some notes on worldview differences this post highlights.

disclaimer: my own rough guesses about Critch's and MIRIs views, which may not be accurate. It's also focusing on the differences that felt important to me, which I think are somewhat different from how Critch presents things. I'm also using "MIRI" as sort of a shorthand for "some cluster of thinking that's common on LW", which isn't necessaril

My understanding of Critch's paradigm seems fairly different from the MIRI paradigm (which AFAICT expects the first AGI mover will gain overwhelming decisive advantage, and meanwhile that interfacing with most existing power structures is... kinda a waste of time (due to them being trapped in bad equilibria that make them inadequate?).

From what I understand of Critch's view, AGI will tend to be rolled out in smaller, less-initially-powerful pieces, and much of the danger of AGI comes from when many different AGIs start interacting with each other, and multiple humans, in ways that get increasingly hard to predict.

Therefore, it's important for humanity as a whole to be able to think critically and govern themselves in scalable ways. I think Critch thinks it is both more tractable to get humanity to collectively govern itself, and also thinks it's more important, which leads to more emphasis on domains like ML Fairness.

Some followup work I'd really like to see are more public discussions about the underlying worldview differences here, and the actual cruxes that generate them.

Speaking for myself (as opposed to either Critch or MIRI-esque researchers), "whether our institutions are capable of governing themselves in the face of powerful AI systems" is an important crux for what strategic directions to prioritize. BUT, I've found all the gears that Critch has pointed to here to be helpful for my overall modeling of the world.

One thing I'd like to see are some more fleshed out examples of the kinds of governance demands that you think might be important in the future and would be bottlenecked on research progress in these areas.

Progress in OODR will mostly be used to help roll out more AI technologies into active deployment more quickly

Research in this area usually does not involve deep or lengthy reflections about the structure of society and human values and interactions, which I think makes this field sort of collectively blind to the consequences of the technologies it will help build.

Incidentally, the main reason I think OODR research is educationally valuable is that it can eventually help with applying agent foundations research to societal-scale safety. Specifically: how can we know if one of the operations (a)-(f) above is safe to perform 1,000,000 times, given that it was safe the first 1,000 times we applied it in a controlled setting, but the setting is changing over time? This is a special case of an OODR question.

In any case, I see AI alignment in turn as having two main potential applications to existential safety:

AI alignment is useful as a metaphor for thinking about how to align the global effects of AI technology with human existence, a major concern for AI governance at a global scale, and
AI alignment solutions could be used directly to govern powerful AI technologies designed specifically to make the world safer.

(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

It sounds like you may be assuming that people will roll out a technology when its reliability meets a certain level X, so that raising reliability of AI systems has no or little effect on the reliability of deployed system (namely it will just be X).

Yes, this is more or less my assumption. I think slower progress on OODR will delay release dates of transformative tech much more than it will improve quality/safety on the eventual date of release.

A more plausible model is that deployment decisions will be based on many axes of quality, e.g. suppose you deploy when the sum of reliability and speed reaches some threshold Y. If that's the case, then raising reliability will improve the reliability and decrease the speed of deployed systems. If you think that increasing the reliability of AI systems is good (e.g. because AI developers want their AI systems to have various socially desirable properties and are limited by their ability to robustly achieve those properties) then this would be good.
I'm not clear on what part of that picture you disagree with or if you think that this is just small relative to some other risks.

I am similarly frustrated by this claim:

The European General Data Protection Regulation (GDPR) is a very good step for regulating how tech companies relate with the public. I say this knowing that GDPR is far from perfect. The reason it's still extremely valuable is that it has initialized the variable defining humanity's collective bargaining position (at least within Europe...) for controlling how tech companies use data.

In summary:

I hope that technical AI x-risk/existential safety researchers focus on legitimizing and fulfilling those governance and accountability demands that are in fact legitimate.
I hope that discussion of AI governance and accountability does not inhabit a frame in which demands for governance and accountability are reliably legitimate.

That said, while I disagree with your analysis, I do agree with your final position:

I hope that technical AI x-risk/existential safety researchers focus on legitimizing and fulfilling those governance and accountability demands that are in fact legitimate.
I hope that discussion of AI governance and accountability does not inhabit a frame in which demands for governance and accountability are reliably legitimate.

If single/single alignment is solved it feels like there are some salient "default" ways in which we'll end up approaching multi/multi alignment:

Existing single/single alignment techniques can also be applied to empower an organization rather than an individual. So we can use existing social technology to form firms and governments and so on, and those organizations will use AI.
AI systems can themselves participate in traditional social institutions. So AI systems that represent individual human interests can interact with each other e.g. in markets or democracies.

For example, let's take the considerations you discuss under CSC:

Third, unless humanity collectively works very hard to maintain a degree of simplicity and legibility in the overall structure of society*, this “alignment revolution” will greatly complexify our environment to a point of much greater incomprehensibility and illegibility than even today’s world. This, in turn, will impoverish humanity’s collective ability to keep abreast of important international developments, as well as our ability to hold the international economy accountable for maintaining our happiness and existence.

Fourth, in such a world, algorithms will be needed to hold the aggregate global behavior of algorithms accountable to human wellbeing, because things will be happening too quickly for humans to monitor. In short, an “algorithmic government” will be needed to govern “algorithmic society”. Some might argue this is not strictly unnecessary: in the absence of a mathematically codified algorithmic social contract, humans could in principle coordinate to cease or slow down the use of these powerful new alignment technologies, in order to give ourselves more time to adjust to and govern their use. However, for all our successes in innovating laws and governments, I do not believe current human legal norms are quite developed enough to stably manage a global economy empowered with individually-alignable transformative AI capabilities.

Again, it's not clear what you expect to happen when existing institutions are empowered by AI and mostly coordinate the activities of AI.

> Third, unless humanity collectively works very hard to maintain a degree of simplicity and legibility in the overall structure of society*, this “alignment revolution” will greatly complexify our environment to a point of much greater incomprehensibility and illegibility than even today’s world. This, in turn, will impoverish humanity’s collective ability to keep abreast of important international developments, as well as our ability to hold the international economy accountable for maintaining our happiness and existence.
One approach to this problem is to work to make it more likely that AI systems can adequately represent human interests in understanding and intervening on the structure of society. But this seems to be a single/single alignment problem (to whatever extent that existing humans currently try to maintain and influence our social structure, such that impairing their ability to do so is problematic at all) which you aren't excited about.

Yes, you've correctly anticipated my view on this. Thanks for the very thoughtful reading!

https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic

That said, I remain interested in more clarity on what you see as the biggest risks with these multi/multi approaches that could be addressed with technical research.

(i) make direct improvements in the relevant institutions, in a way that anticipates the changes brought about by AI but will most likely not look like AI research,

It's always possible to say, solving the single/single alignment problem will prevent anything like that from happening in the first place, but why put all your hopes on plan A, when plan B is relatively neglected?

It seems premature to say, in advance of actually seeing what such research uncovers, whether the relevant mechanisms and governance improvements are exactly the same as the improvements we need for good governance generally, or different.

Outcome C: Both atomic and social alignment keep with with AI capability. Transformative AI is aligned with society/humanity as a whole, resulting in a benevolent future for everyone.

The "default" pace of progress in each dimension: e.g. if we assume atomic AI alignment will be solved in time anyway, then we should focus on social AI alignment.
The inherent difficulty of each dimension: e.g. if we assume atomic AI alignment is relatively hard (and will therefore take a long time to solve) whereas social AI alignment becomes relatively easy once atomic AI alignment is solved, then we should focus on atomic AI alignment.
The extent to which each dimension depends on others: e.g. if we assume it's impossible to make progress in social AI alignment without reaching some milestone in atomic AI alignment, then we should focus on atomic AI alignment for now. Similarly, some argued we shouldn't work on alignment at all before making more progress in capability.
More precisely, the last two can be modeled jointly as the cost of marginal progress in a given dimension as a function of total progress in all dimensions.
The extent to which outcome B is bad for people not in the elite: If it's not too bad then it's more important to prevent outcome A by focusing on atomic AI alignment, and vice versa.

Outcome B: Progress in atomic AI alignment keeps up with progress in AI capability, but progress in social AI alignment doesn't keep up. Transformative AI is aligned with a small fraction of the population, resulting in this minority gaining absolute power and abusing it to create an extremely inegalitarian future. Wars between different factions are also a concern.

It's unclear to me how this particular outcome relates to social alignment (or at least to the kinds of research areas in this post). Some possibilities:

Does failure to solve social alignment mean that firms and governments cannot use AI to represent their shareholders and constituents? Why might that be? (E.g. what's a plausible approach to atomic alignment that couldn't be used by a firm or government?)
Does AI progress occur unevenly such that some group gets much more power/profit, and then uses that power? If so, how would technical progress on alignment help address that outcome? (Why would the group with power be inclined to use whatever techniques we're imagining?) Also, why does this happen?
Does AI progress somehow complicate the problem of governance or corporate governance such that those organizations can no longer represent their constituents/shareholders? What is the mechanism (or any mechanism) by which this happens? Does social alignment help by making new forms of organization possible, and if so should I just be thinking of it as a way of improving those institutions, or is it somehow distinctive?
Do we already believe that the situation is gravely unequal (e.g. because governments can't effectively represent their constituents and most people don't have a meaningful amount of capital) and AI progress will exacerbate that situation? How does social alignment prevent that?

If the number of AIs is small, the AI interface becomes a single point of failure, an actor that can hijack the interface will have enormous power.
If the number of AIs is small, it might be unclear what inputs should be fed into the AI in order to fairly represent the collective. It requires "manually" solving the preference aggregation problem, and faults of the solution might be amplified by the powerful optimization to which it is subjected.
If the number of AIs is more than one then we should make sure the AIs are good at cooperating, which requires research about multi-AI scenarios.
If the number of AIs is large (e.g. one per person), we need the interface to be sufficiently robust that people can use it correctly without special training. Also, this might be prohibitively expensive.

My sense is that the OP may be more concerned about failures in which no one gets what they want rather than outcome B per se

Well, the OP did say:

(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

I understood it as hinting at outcome B, but I might be wrong.

Outcome C is most naturally achieved using "direct democracy" TAI, i.e. one that collects inputs from everyone and aggregates them in a reasonable way. We can try emulating democratic AI via single user AI, but that's hard because:

I'm not sure what's most natural, but I do consider this a fairly unlikely way of achieving outcome C.

In the first scenario, to get outcome C we need the singleton to either be democratic by design, or have a very sophisticated and robust system of controlling access to it.

The OP's conclusion seems to be that social AI alignment should be the main focus. Personally, I'm less convinced. It would be interesting to see more detailed arguments about the above parameters that support or refute this thesis.

https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic

I'd be curious to see if it convinces you that what you call "social alignment" should be our main focus, or at least a much greater focus than currently.

with the exception of people who decided to gamble on being part of the elite in outcome B

Therefore, the AI race game is not all or nothing. The more win probability lands on parties that can bargain properly, the less multiversal utility is burned.

I don't think making acausal trade work is that hard. All that is required is:

That the winner cares about the counterfactual versions of himself that didn't win, or equivalently, is unsure whether they're being simulated by another winner. (huh, one could actually impact this through memetic work today, though messing with people's preferences like that doesn't sound friendly)
That they think to simulate alternate winners before they expand too far to be simulated.

Tech companies will be incentivized to deploy AI with slipshod alignment, which might then take actions that no one wants and which pose existential risk. (Concretely, I'm thinking of out with a whimper and out with a bang scenarios.) But the existence of better alignment techniques might legitimize governance demands, i.e. demands that tech companies don't make products that do things that literally no one wants.
Single/single alignment might be a prerequisite to certain computational social choice solutions. E.g., once we know how to build an agent that "does what [human] wants", we can then build an agent that "helps [human 1] and [human 2] draw up incomplete contracts for mutual benefit subject to the constraints in the [policy] written by [human 3]". And slipshod alignment might not be enough for this application.

It seems like we'll have to put special effort into both single/single alignment and multi/single "alignment", because the free market might not give it to us.

(2) is essentially aiming to take over the world in the name of making it safer, which is not generally considered the kind of thing we should be encouraging lots of people to do.

Before I explain why I think this should be useful even on your views, let me try clarifying the field more. Looking at the papers you list as exemplars of PL:

(2017) Deep reinforcement learning from human preferences, Christiano, Paul F; Leike, Jan; Brown, Tom; Martic, Miljan; Legg, Shane; Amodei, Dario.
(2018) Reward learning from human preferences and demonstrations in Atari, Ibarz, Borja; Leike, Jan; Pohlen, Tobias; Irving, Geoffrey; Legg, Shane; Amodei, Dario.
(2018) The alignment problem for Bayesian history-based reinforcement learners, Everitt, Tom; Hutter, Marcus.
(2019) Learning human objectives by evaluating hypothetical behavior, Reddy, Siddharth; Dragan, Anca D; Levine, Sergey; Legg, Shane; Leike, Jan.
(2019) On the feasibility of learning, rather than assuming, human biases for reward inference, Shah, Rohin; Gundotra, Noah; Abbeel, Pieter; Dragan, Anca D.
(2020) Reward-rational (implicit) choice: A unifying formalism for reward learning, Jeon, Hong Jun; Milli, Smitha; Dragan, Anca D.

A number of blogs seem to treat [AI existential safety, AI alignment, and AI safety] as near-synonyms (e.g., LessWrong, the Alignment Forum), and I think that is a mistake, at least when it comes to guiding technical work for existential safety.

I strongly agree with the benefits of having separate terms and generally like your definitions.

In this post, AI existential safety means “preventing AI technology from posing risks to humanity that are comparable or greater than human extinction in terms of their moral significance.”

Good to hear!

If I read that term ["AI existential safety"] without a definition I would assume it meant "reducing the existential risk posed by AI." Hopefully you'd be OK with that reading. I'm not sure if you are trying to subtly distinguish it from Nick's definition of existential risk or if the definition you give is just intended to be somewhere in that space of what people mean when they say "existential risk" (e.g. the LW definition is like yours).

Curated, for several reasons.

Thanks for this long and very detailed post!

The MARL projects with the greatest potential to help are probably those that find ways to achieve cooperation between decentrally trained agents in a competitive task environment, because of its potential to minimize destructive conflicts between fleets of AI systems that cause collateral damage to humanity. That said, even this area of research risks making it easier for fleets of machines to cooperate and/or collude at the exclusion of humans, increasing the risk of humans becoming gradually disenfranchised and perhaps replaced entirely by machines that are better and faster at cooperation than humans.

One approach to this research area is to continually ex-amine social dilemmas through the lens of whatever is the leading AI devel-opment paradigm in a given year or decade, and attempt to classify interest-ing behaviors as they emerge. This approach might be viewed as analogousto developing “transparency for multi-agent systems”: first develop inter-esting multi-agent systems, and then try to understand them.

OpenAI just released a paper on RL agents in social Dilemmas, https://arxiv.org/pdf/2011.05373v1.pdf and there is some previous work. This is more directly multiagent RL, but there is some consideration for things like choosing the right overall social welfare metric.
There are also two papers examining bandit algorithms in iterated voting scenarios, https://hal.archives-ouvertes.fr/hal-02641165/document and https://www.irit.fr/~Umberto.Grandi/scone/Layka.m2.pdf.

My current research is attempting to build on the second of these.

In short, computational social choice research will be necessary to legitimize and fulfill governance demands for technology companies (automated and human-run companies alike) to ensure AI technologies are beneficial to and controllable by human society.
...

CSC neglect:
As mentioned above, I think CSC is still far from ready to fulfill governance demands at the ever-increasing speed and scale that will be needed to ensure existential safety in the wake of “the alignment revolution”.

Great post!

I suppose you'll be more optimistic about Single/Single areas if you update towards fast/discontinuous takeoff?

The main way I can see present-day technical research benefitting existential safety is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise over the next 10-30 years. In short, there often needs to be some amount of traction on a technical area before it’s politically viable for governing bodies to demand that institutions apply and improve upon solutions in those areas.

Regarding the two ways you enumerate in which AI alignment could serve to further existential safety, I think a third, more viable, way is missing:

AI alignment solutions allow humans to build powerful AI systems that behave as planned without compromising existential safety.

Planned summary for the Alignment Newsletter:

This long post explains the author’s beliefs about a variety of research topics relevant to AI existential safety. First, let’s look at some definitions.
While AI safety alone just means getting AI systems to avoid risks (including e.g. the risk of a self-driving car crashing), _AI existential safety_ means preventing AI systems from posing risks at least as bad as human extinction. _AI alignment_ on the other hand is about getting an AI system to try to / succeed at doing what a person or institution wants them to do. (The “try” version is _intent alignment_, while the “succeed” version is _impact alignment_.)
Note that AI alignment is not the same thing as AI existential safety. In addition, the author makes the stronger claim that it is insufficient to guarantee AI existential safety, because AI alignment tends to focus on situations involving a single human and a single AI system, whereas AI existential safety requires navigating systems involving multiple humans and multiple AI systems. Just as AI alignment researchers worry that work on AI capabilities for useful systems doesn’t engage enough with the difficulty of alignment, the author worries that work on alignment doesn’t engage enough with the difficulty of multiagent systems.
The author also defines _AI ethics_ as the principles that AI developers and systems should follow, and _AI governance_ as identifying _and enforcing_ norms for AI developers and systems to follow. While ethics research may focus on resolving disagreements, governance will be more focused on finding agreeable principles and putting them into practice.
Let’s now turn to how to achieve AI existential safety. The main mechanism the author sees is to _anticipate, legitimize, and fulfill governance demands_ for AI technology. Roughly, governance demands are those properties which there are social and political pressures for, such as “AI systems should be fair” or “AI systems should not lead to human extinction”. If we can _anticipate_ these demands in advance, then we can do technical work on how to _fulfill_ or meet these demands, which in turn _legitimizes_ them, that is, it makes it clearer that the demand can be fulfilled and so makes it easier to create common knowledge that it is likely to become a legal or professional standard.
We then turn to various different fields of research, which the author ranks on three axes: helpfulness to AI existential safety (including potential negative effects), educational value, and neglectedness. Note that for educational value, the author is estimating the benefits of conducting research on the topic _to the researcher_, and not to (say) the rest of the field. I’ll only focus on helpfulness to AI existential safety below, since that’s what I’m most interested in (it’s where the most disagreement is, and so where new arguments are most useful), but I do think all three axes are important.
The author ranks both preference learning and out of distribution robustness lowest on helpfulness to existential safety (1/10), primarily because companies already have a strong incentive to have robust AI systems that understand preferences.
Multiagent reinforcement learning (MARL) comes only slightly higher (2/10), because since it doesn’t involve humans its main purpose seems to be to deploy fleets of agents that may pose risks to humanity. It is possible that MARL research could help by producing <@cooperative agents@>(@Cooperative AI Workshop@), but even this carries its own risks.
Agent foundations is especially dual-use in this framing, because it can help us understand the big multiagent system of interactions, and there isn’t a restriction on how that understanding could be used. It consequently gets a low score (3/10), that is a combination of “targeted applications could be very useful” and “it could lead to powerful harmful forces”.
Minimizing side effects starts to address the challenges the author sees as important (4/10): in particular, it can allow us both to prevent accidents, where an AI system “messes up”, and it can help us prevent externalities (harms to people other than the primary stakeholders), which are one of the most challenging issues in regulating multiagent systems.
Fairness is valuable for the obvious reason: it is a particular governance demand that we have anticipated, and research on it now will help fulfill and legitimize that demand. In addition, research on fairness helps get people to think at a societal scale, and to think about the context in which AI systems are deployed. It may also help prevent centralization of power from deployment of AI systems, since that would be an unfair outcome.
The author would love it if AI/ML pivoted to frequently think about real-life humans and their desires, values and vulnerabilities. Human-robot interaction (HRI) is a great way to cause more of that to happen, and that alone is valuable enough that the author assigns it 6/10, tying it with fairness.
As we deploy more and more powerful AI systems, things will eventually happen too quickly for humans to monitor. As a result, we will need to also automate the process of governance itself. The area of computational social choice is well-posed to make this happen (7/10), though certainly current proposals are insufficient and more research is needed.
Accountability in ML is good (8/10) primarily because as we make ML systems accountable, we will likely also start to make tech companies accountable, which seems important for governance. In addition, in a <@CAIS@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@) scenario, better accountability mechanisms seem likely to help in ensuring that the various AI systems remain accountable, and thus safer, to human society.
Finally, interpretability is useful (8/10) for the obvious reasons: it allows developers to more accurately judge the properties of systems they build, and helps in holding developers and systems accountable. But the most important reason may be that interpretable systems can make it significantly easier for competing institutions and nations to establish cooperation around AI-heavy operations.

Planned opinion:

I liked this post: it’s a good exploration of what you might do if your goal was to work on technical approaches to future governance challenges; that seems valuable and I broadly agree with it (though I did have some nitpicks in [this comment](https://www.alignmentforum.org/posts/hvGoYXi2kgnS3vxqb/some-ai-research-areas-and-their-relevance-to-existential-1?commentId=LjvvW3xddPTXaequB)).
There is then an additional question of whether the best thing to do to improve AI existential safety is to work on technical approaches to governance challenges. There’s some pushback on this claim in the comments that I agree with; I recommend reading through it. It seems like the core disagreement is on the relative importance of risks: in particular, it sounds like the author thinks that existing incentives for preference learning and out-of-distribution robustness are strong enough that we mostly don’t have to worry about it, whereas governance will be much more challenging; I disagree with at least that relative ranking.
It’s possible that we agree on the strength of existing incentives -- I’ve <@claimed@>(@Conversation with Rohin Shah@) a risk of 10% for existentially bad failures of intent alignment if there is no longtermist intervention, primarily because of existing strong incentives. That could be consistent with this post, in which case we’d disagree primarily on whether the “default” governance solutions are sufficient for handling AI risk, where I’m a lot more optimistic than the author.

I see CSC and SEM as highly linked via modularity of processes.

This is an excellent post, thank you a lot for it. Below an assortment of remarks and questions.

The table is interesting. Here's my attempt at estimating the values for helpfulness, educational value and 2022 neglect (epistemic effort: I made this up on the spot from intuitions):

OOD robustness: 2/10, 4/10, 2/10
Agent foundations: 5/10, 7/10, 9/10
Multi-agent RL: 1/10, 4/10, 2/10
Preference learning: 8/10, 7/10, 2/10
Side-effect minimization: 7/10, 5/10, 4/10
Human-robot interaction: 1/10, 1/10, 2/10
Interpretability: 9/10, 8/10, 3/10
Fairness in ML: 2/10, 4/10, 1/10
Computational social choice: 3/10, 5/10, 2/10
Accountability in ML: 6/10, 2/10, 3/10

Furthermore:

Contributions to preference learning are not particularly helpful to existential safety in my opinion, because their most likely use case is for modeling human consumers just well enough to create products they want to use and/or advertisements they want to click on. Such advancements will be helpful to rolling out usable tech products and platforms more quickly, but not particularly helpful to existential safety.*
*) I hope no one will be too offended by this view. I did have some trepidation about expressing it on the “alignment’ forum, but I think I should voice these concerns anyway, for the following reason.

While I don't put a lot of probability on this view, I'm glad it was expressed here.

However, the need to address multilateral externalities will arise very quickly after unilateral externalities are addressed well enough to roll out legally admissible products, because most of our legal systems have an easier time defining and punishing negative outcomes that have a responsible party. I don’t believe this is a quirk of human legal systems: when two imperfectly aligned agents interact, they complexify each other’s environment in a way that consumes more cognitive resources than interacting with a non-agentic environment.

Two more (small) questions:

Is "translucent game theory" the same as "open-source game theory"?
You say you prefer talking about papers, but do you by chance have a recommendation for CSC textbooks? The handbook you link doesn't have exercises, if I remember correctly.

Nice post! In particular, I like your reasoning about picking research topics:

The main way I can see present-day technical research benefiting existential safety is by anticipating, legitimizing and fulfilling governance demands for AI technology that will arise over the next 10-30 years. In short, there often needs to be some amount of traction on a technical area before it’s politically viable for governing bodies to demand that institutions apply and improve upon solutions in those areas.

I've highly voted this post for a few reasons.

First, this post contains a bunch of other individual ideas I've found quite helpful for orienting. Some examples:

Useful thoughts on which term definitions have "staying power," and are worth coordinating around.
The zero/single/multi alignment framework.
The details on how to anticipate legitimize and fulfill governance demands.

Some notes on worldview differences this post highlights.

Some followup work I'd really like to see are more public discussions about the underlying worldview differences here, and the actual cruxes that generate them.

LESSWRONG
LW

LESSWRONG
LW

206

Some AI research areas and their relevance to existential safety

206

Ω 65

Introduction

Epistemic status & caveats

Defining our objectives

Distinguishing our objectives

AI alignment is inadequate for AI existential safety

Anticipating, legitimizing and fulfilling governance demands

Research Areas

Out of distribution robustness (OODR)

Agent foundations (AF)

Multi-agent reinforcement learning (MARL)

Preference learning (PL)

Human-robot interaction (HRI)

Side-effect minimization (SEM)

Interpretability in ML (IntML)

Fairness in ML (FairML)

Computational Social Choice (CSC)

Accountability in ML (AccML)

Conclusion

Reflections on scope & omissions

206

Ω 65

206

Ω 65