This post is part of the sequence version of the Effective Altruism Foundation's research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence.

5 Contemporary AI architectures

Although the architectures of TAI systems will likely be quite different to existing ones, it may still be possible to gain some understanding of cooperation failures among such systems using contemporary tools [1]. First, it is plausible that some aspects of contemporary deep learning methods will persist in TAI systems, making experiments done today directly relevant. Second, even if this is not the case, such research may still help by laying the groundwork for the study of cooperation failures in more advanced systems.

5.1 Learning to solve social dilemmas

As mentioned above, some attention has recently been devoted to social dilemmas among deep reinforcement learners (Leibo et al., 2017; Peysakhovich and Lerer, 2017; Lerer and Peysakhovich, 2017; Foerster et al., 2018; Wang et al., 2018). However, a fully general, scalable but theoretically principled approach to achieving cooperation among deep reinforcement learning agents is lacking. In Example 5.1.1 we sketch a general approach to cooperation in general-sum games which subsumes several recent methods, and afterwards list some research questions raised by the framework.


Example 5.1.1 (Sketch of a framework for cooperation in social dilemmas).

The setting is a 2-agent decision process. At each timestep t, each agent i receives an observation oti; takes an action based on their policy (assumed to be deterministic for simplicity); and receives a reward . Player expects to get a value of if the policies are deployed. Examples of such environments which are amenable to study with contemporary machine learning tools are the "sequential social dilemmas'' introduced by Leibo et al. (2017). These include a game involving potential conflict over scarce resources, as well as a coordination game similar in spirit to Stag Hunt (Table 1).

Suppose that the agents (or their overseers) have the opportunity to choose what policies to deploy by simulating from a model, and to bargain over the choice of policies. The idea is for the parties to arrive at a welfare function w which they agree to jointly maximize; deviations from the policies which maximize the welfare function will be punished if detected. Let be a "disagreement point'' measuring how well agent i expects to do if they deviate from the welfare-maximizing policy profile. This could be their security value or an estimate of their value when the agents use independent learning algorithms. Finally, define player 's ideal point . Table 5 displays welfare functions corresponding to several widely-discussed bargaining solutions, adapted to the multi-agent reinforcement learning setting.

Define the cooperative policies as . We need a way of detecting defections so that we can switch from the cooperative policy to a punishment policy. Call a function that detects defections a "switching rule''. To make the framework general, consider switching rules which return 1 for Switch and 0 for Stay. Rules depend on the agent's observation history . The contents of will differ based on the degree of observability of the environment, as well as how transparent agents are to each other (cf. Table 6). Example switching rules include:

  • Switch when I see that my counterpart doesn't follow the cooperative policy (cf. Lerer and Peysakhovich 2017): ;
  • Switch when my rewards indicate my counterpart is not cooperating (Peysakhovichand Lerer, 2017): for some ;
  • Switch when the probability the my counterpart is cooperating, according to my trained defection-detecting model, is low (cf. Wang et al. 2018): for some .

Finally, the agents need punishment policies to switch to in order to disincentivize defections. An extreme case of a punishment policy is the one in which an agent commits to minimizing their counterpart's utility once they have defected: . This is the generalization of the so-called "grim trigger'' strategy underlying the classical theory of iterated games (Friedman, 1971; Axelrod, 2000). It can be seen that each player submitting a grim trigger strategy in the above framework constitutes a Nash equilibrium in the case that the counterpart's observations and actions are visible (and therefore defections can be detected with certainty). However, grim trigger is intuitively an extremely dangerous strategy for promoting cooperation, and indeed does poorly in empirical studies of different strategies for the iterated Prisoner's Dilemma (Axelrod and Hamilton, 1981). One possibility is to train more forgiving, tit-for-tat-like punishment policies, and play a mixed strategy when choosing which to deploy in order to reduce exploitability.


Some questions facing a framework for solving social dilemmas among deep reinforcement learners, such as that sketched in Example 5.1.1, include:

  • How does the ability of agents to cooperate deteriorate as their ability to observe one another's actions is reduced?
  • Methods for promoting cooperation among deep reinforcement learners such as those discussed in Example 5.1.1 assume 1) complete information (agents do not have private information about, say, their utility functions) and 2) only two players. How can cooperation be achieved in cases of incomplete information and in coalitional games?

In addition to the theoretical development of open-source game theory (Section 3.2), interactions between transparent agents can be studied using tools like deep reinforcement learning. Learning equilibrium (Brafman and Tennenholtz, 2003) and learning with opponent-learning awareness (LOLA) (Foerster et al., 2018; Baumann et al.,2018; Letcher et al., 2018) are examples of analyses of learning under transparency.

  • "Opponent-aware" methods like Foerster et al. (2018)'s LOLA [2] assume that agents can efficiently verify relevant aspects of one another’s internal workings. How can such verification be achieved in practice? How can agents still reap some of the benefits of transparency in the case of incomplete verifiability? Table 6 lists several recent multi-agent learning techniques which assume varying degrees of agent transparency; given the difficulty of achieving total transparency, successful real-world auditing schemes will likely require a blend of such techniques.
  • How should agents of asymmetric capabilities conduct open-source interactions? (As a simple example, one might consider interactions between a purely model-free agent and one which has access to an accurate world model.)

5.2 Multi-agent training

Multi-agent training is an emerging paradigm for the training of generally intelligent agents (Lanctot et al., 2017; Rabinowitz et al., 2018; Suarez et al., 2019; Leibo et al.,2019). It is as yet unclear what the consequences of such a learning paradigm are for the prospects for cooperativeness among advanced AI systems.

  • Will multi-agent training result in human-like bargaining behaviors, involving for instance the costly punishment of those perceived to be acting unfairly (Hen-rich et al., 2006)? What are the implications for the relative ability of, say, classical and behavioral game theory [3] to predict the behavior of TAI-enabled systems? And, critically, what are the implications for these agents' ability to implement peaceful bargaining strategies (Section 4)? See especially the literature on behavioral evidence regarding rational crisis bargaining (Quek, 2017; Renshonet al., 2017). See also Section 6.1.
  • One potentially significant disanalogy of multi-agent training with human biological and cultural evolution is the possibility that agents will have (partial) access to one another's internal workings (see Sections 3.2 and 5.1). What can experiments in contemporary ML architectures tell us about the prospects for efficiency gains from open-source multi-agent learning (Section 5.1)?
  • How interpretable will agents trained via multi-agent training be? What are the implications for their ability to make credible commitments (Section 3)?
  • Adversarial training has been proposed as an approach to limiting risks from advanced AI systems (Christiano, 2018d; Uesato et al., 2018). Are risks associated with cooperation failures (such as the propensity to make destructive threats) likely to be found by default adversarial training procedures, or is there a need for the development of specialized techniques?

5.3 Decision theory

Understanding the decision-making procedures implemented by different machine learning algorithms may be critical for assessing how they will behave in high-stakes interactions with humans or other AI agents. One potentially relevant factor is the decision theory implicitly implemented by a machine learning agent. We discuss decision theory at greater length in Section 7.2, but briefly: By an agent’s decision theory, we roughly mean which dependences the agent accounts for when predicting the outcomes of its actions. While it is standard to consider only the causal effects of one’s actions ("causal decision theory’’ (CDT)), there are reasons to think agents should account for non-causal evidence that their actions provide about the world [4]. And, different ways of computing the expected effects of actions may lead to starkly different behavior in multi-agent settings.

  • Oesterheld (2017a) considers a simple agent designed to maximize the approval score given to it by an overseer (i.e., "approval-directed’’ Christiano 2014). He shows that the decision theory implicit in the decisions of such an agent is determined by how the agent and overseer compute the expected values of actions. In this vein: What decision-making procedures are implicit in ML agents trained according to different protocols? See for instance Krueger et al. (2019)’s dis-cussion of “hidden incentives for inducing distributional shift” associated withcertain population-based training methods (Jaderberg et al., 2017) for reinforce-ment learning; cf. Everitt et al. (2019) on understanding agent incentives withcausal influence diagrams.
  • A "model-free’’ agent is one which implicitly learns the expected values of its actions by observing the streams of rewards that they generate; such agents are the focus of most deep reinforcement learning research. By contrast, a "model-based’’ agent (Sutton and Barto, 2018, Ch. 8) is one which explicitly models the world and computes the expected values of its actions by simulating their effects on the world using this model. In certain model-based agents, an agent’s decision theory can be specified directly by the modeler, rather than arising implicity [5]. Do any decision-theoretic settings specified by the modeler robustly lead to cooperative outcomes across a wide range of multi-agent environments? Or are outcomes highly sensitive to the details of the situation?

6 Humans in the loop [6]

TAI agents may acquire their objectives via interaction with or observation of humans. Relatedly, TAI systems may consist of AI-assisted humans, as in Drexler (2019)’s comprehensive AI services scenario. Relevant AI techniques include:

  • Approval-directedness, in which an agent attempts to maximize humanassigned approval scores (Akrour et al., 2011; Christiano, 2014);
  • Imitation Schaal, 1999; Ross et al., 2011; Evans et al., 2018), in which an agent attempts to imitate the behavior of a demonstrator;
  • Preference inference (Ng et al., 2000; Hadfield-Menell et al., 2016; Christianoet al., 2017; Leike et al., 2018), in which an agent attempts to learn the reward function implicit in the behavior of a demonstrator and maximize this estimated reward function.

In human-in-the-loop scenarios, human responses will determine the outcomes of opportunities for cooperation and conflict.

6.1 Behavioral game theory

Behavioral game theory has often found deviations from theoretical solution concepts among human game-players. For instance, people tend to reject unfair splits in the ultimatum game despite this move being ruled out by subgame perfection (Section 3). In the realm of bargaining, human subjects often reach different bargaining solutions than those standardly argued for in the game theory literature (in particular, the Nash (Nash,1950) and Kalai-Smorodinsky (Kalai et al., 1975) solutions) (Felsenthal and Diskin,1982; Schellenberg, 1988). Thus the behavioral game theory of human-AI interaction in critical scenarios may be a vital complement to theoretical analysis when designing human-in-the-loop systems.

  • Under what circumstances do humans interacting with an artificial agent become convinced that the agent’s commitments are credible (Section 3)? How do humans behave when they believe their AI counterpart’s commitments are credible or not? Are the literatures on trust and artificial agents (e.g., Grodzinsky et al.2011; Coeckelbergh 2012) and automation bias (Mosier et al., 1998; Skitka et al., 1999; Parasuraman and Manzey, 2010) helpful here? (See also Crandall et al. (2018), who develop an algorithm for promoting cooperation between humans and machines.)
  • In sequential games with repeated opportunities to commit via a credible commitment device, how quickly do players make such commitments? How do other players react? Given the opportunity to commit to bargaining rather than to simply carry out a threat if their demands aren't met (see Example 4.1.1), what do players do? Cf. experimental evidence regarding commitment and crisis bargaining; e.g., Quek (2017) finds that human subjects go to war much more frequently in a war game when commitments are not enforceable.
  • Sensitivity to stakes varies over behavioral decision- and game-theoretic contexts (e.g., Kahneman et al. 1999; Dufwenberg and Gneezy 2000; Schmidt et al.2001; Andersen et al. 2011). How sensitive to stakes are the behaviors in which we are most interested? (This is relevant as we’re particularly concerned with catastrophic failures of cooperation.)
  • How do humans model the reasoning of intelligent computers, and what are the implications for limiting downsides in interactions involving humans? For instance, in experiments on games, humans tend to model their counterparts as reasoning at a lower depth than they do (Camerer et al., 2004) [7]. But this may not be the case when humans instead face computers they believe to be highly intelligent.
  • How might human attitudes towards the credibility of artificial agents change over time — for instance, as a result of increased familiarity with intelligent machines in day-to-day interactions? What are the implications of possible changes in attitudes for behavioral evidence collected now?
  • We are also interested in extensions of existing experimental paradigms in behavioral game theory to interactions between humans and AIs, especially research on costly failures such as threats (Bolle et al., 2011; Andrighetto et al., 2015).

6.2 AI delegates

In one class of TAI trajectories, humans control powerful AI delegates who act on their behalf (gathering resources, ensuring safety, etc.). One model for powerful AI delegates is Christiano (2016a)’s (recursively titled) "Humans consulting HCH’’ (HCH). Saunders (2019) explains HCH as follows:

HCH, introduced in Humans consulting HCH (Christiano, 2016a), is a computational model in which a human answers questions using questions answered by another human, which can call other humans, which can call other humans, and so on. Each step in the process consists of a human taking in a question, optionally asking one or more sub-questions to other humans, and returning an answer based on those subquestions. HCH can be used as a model for what Iterated Amplification [8] would be able to do in the limit of infinite compute.

A particularly concerning class of cooperation failures in such scenarios are threats by AIs or AI-assisted humans against one another.

  • Threats could target 1) the delegate’s objectives (e.g., destroying the system’s resources or its ability to keep its overseer alive and comfortable), or 2) the human overseer’s terminal values. Threats of the second type might be much worse. It seems important to investigate the incentives for would-be threateners to use one type of threat or the other, in the hopes of steering dynamics towards lower-stakes threats.
  • We are also interested in how interactions between humans and AI delegates could be limited so as to minimize threat risks.

Saunders also discusses a hypothetical manual for overseers in the HCH scheme. In this manual, overseers could find advice "on how to corrigibly answer questions by decomposing them into sub-questions.’’ Exploring practical advice that could be included in this manual might be a fruitful exercise for identifying concrete interventions for addressing cooperation failures in HCH and other human-in-the-loop settings. Examples include:

  • Instructions related to rational crisis bargaining (Section 4.1);
  • Instructions related to the implementation of surrogate goals (Section 4.2)

Acknowledgements & References


  1. Cf. Christiano (2016b)'s discussion of "prosaic’’ artificial general intelligence, defined as that "which doesn’t reveal any fundamentally new ideas about the nature of intelligence or turn up any 'unknown unknowns' ". ↩︎

  2. Although Foerster et al. (2018) develop a version of LOLA with "opponent modeling'' where an agent only makes inferences about their counterpart's parameters, rather than actually seeing them. Zhang andLesser (2010) present a similar method, though unlike LOLA theirs does not attempt to shape the counterpart's update. ↩︎

  3. Rabin (1993); Fehr and Schmidt (1999); Bolton and Ockenfels (2000) study fairness and trust; Camererand Hua Ho (1999) develop a large class of models for explaining human learning in games; and Camerer (2008, Ch. 4) reviews the behavioral literature on bargaining, concluding that a satisfactory theory of bargaining would "probably weave together perceptions of equity…, stable social preferences for equal payoffs or fair treatment, heuristic computation, and individual differences...’’. Also see the discussion of behavioral game theory and human evolution by Hagen and Hammerstein (2006) and references therein. ↩︎

  4. See also Camerer and Hua Ho (1999)’s distinction between "the law of actual effect’’ and "the law of simulated effect’’. ↩︎

  5. For example, (Everitt et al., 2015) develop sequential extensions of the most commonly studied decision theories, causal and evidential decision theory, in a general reinforcement learning framework. One could develop similar extensions for model-based multi-agent frameworks, like Gmytrasiewicz and Doshi (2005)’s interactive partially observable Markov decision processes. ↩︎

  6. Notes by Lukas Gloor contributed substantially to the content of this section. ↩︎

  7. This has been illustrated in the -Beauty Contest Game (BCG). In the BCG, multiple players simultaneously say a number between 0 and 100. The winner is the person whose number is closest to the mean of all the numbers, times a commonly known number in . If there is a tie, the payoff is divided evenly. This game has a single Nash equilibrium: everyone says 0. However, human players typically don’t play this way. Instead, experimental evidence suggests that players model others as reasoning fewer steps ahead than they (“If the know I choose X then they will choose Y, so then I will choose Z instead…”), and then choose the best response to these predicted moves (Nagel, 1995). ↩︎

  8. Iterated (Distillation and) Amplification (IDA) is Christiano (2018b)’s proposal for training aligned AI systems. In brief, it consists of iterating a Distillation step in which the capabilities of a team of AI delegates are distilled into a single agent; and an Amplification step, in which the capabilities of the distilled agent are amplified by copying that agent many times and delegating different tasks to different copies. The hope for IDA as an approach to AI safety is that many slightly less-capable agents will be able to control the more powerful agent produced by the latest Distill step, at each iteration of the process. See Cotra (2018) for an accessible overview of IDA. ↩︎

New Comment
4 comments, sorted by Click to highlight new comments since:

It seems like replacing two agents A and B by a single agent that optimizes for their welfare function would avoid the issue of punishment. I guess that doing this might be feasible in some cases for artificial agents (as a single agent optimizing for the welfare function is a simpler object than the two-agent dynamics including punishment) and potentially understudied, as the solution seems harder to implement for humans (even though human solutions to collective action problems at least resemble the approach). One key problem might be finding a welfare function that both agents agree on, especially if there is information assymetry.

Any thought on this?

Edit: The approach seems to be most trivial when both agents share their world model and optimize for explicit utilities over this world model. More general, two principals with similar amounts of compute and similarly easily optimizable utility functions are most likely better off building an agent that optimizes for their welfare instead of two agents that need to learn to compete and cooperate. Optimizing for the welfare function applied to the agent's value functions can be done by a somewhat straightforward modification of Q-learning or (in the case of differentiable welfare) policy gradient methods.

I definitely think it's worth exploring. I have the intuition that creating a single agent might be difficult for various logistical and political reasons, and so it feels more robust to figure out the multiagent case. But I would certainly like to have a clearer picture of how and under what circumstances several AI developers might implement a single compromise agent.

Inconsequential heads up: At least on my screen, it seemed there were symbols missing at the ends of each of the following two sentences:

Define the cooperative policies as .

And:

An extreme case of a punishment policy is the one in which an agent commits to minimizing their counterpart's utility once they have defected: .

Fixed, thanks :)