It seems like replacing two agents A and B by a single agent that optimizes for their welfare function would avoid the issue of punishment. I guess that doing this might be feasible in some cases for artificial agents (as a single agent optimizing for the welfare function is a simpler object than the two-agent dynamics including punishment) and potentially understudied, as the solution seems harder to implement for humans (even though human solutions to collective action problems at least resemble the approach). One key problem might be finding a welfare function that both agents agree on, especially if there is information assymetry.
Any thought on this?
Edit: The approach seems to be most trivial when both agents share their world model and optimize for explicit utilities over this world model. More general, two principals with similar amounts of compute and similarly easily optimizable utility functions are most likely better off building an agent that optimizes for their welfare instead of two agents that need to learn to compete and cooperate. Optimizing for the welfare function applied to the agent's value functions can be done by a somewhat straightforward modification of Q-learning or (in the case of differentiable welfare) policy gradient methods.
I definitely think it's worth exploring. I have the intuition that creating a single agent might be difficult for various logistical and political reasons, and so it feels more robust to figure out the multiagent case. But I would certainly like to have a clearer picture of how and under what circumstances several AI developers might implement a single compromise agent.
Inconsequential heads up: At least on my screen, it seemed there were symbols missing at the ends of each of the following two sentences:
Define the cooperative policies as .
And:
An extreme case of a punishment policy is the one in which an agent commits to minimizing their counterpart's utility once they have defected: .
This post is part of the sequence version of the Effective Altruism Foundation's research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence.
5 Contemporary AI architectures
Although the architectures of TAI systems will likely be quite different to existing ones, it may still be possible to gain some understanding of cooperation failures among such systems using contemporary tools [1]. First, it is plausible that some aspects of contemporary deep learning methods will persist in TAI systems, making experiments done today directly relevant. Second, even if this is not the case, such research may still help by laying the groundwork for the study of cooperation failures in more advanced systems.
5.1 Learning to solve social dilemmas
As mentioned above, some attention has recently been devoted to social dilemmas among deep reinforcement learners (Leibo et al., 2017; Peysakhovich and Lerer, 2017; Lerer and Peysakhovich, 2017; Foerster et al., 2018; Wang et al., 2018). However, a fully general, scalable but theoretically principled approach to achieving cooperation among deep reinforcement learning agents is lacking. In Example 5.1.1 we sketch a general approach to cooperation in general-sum games which subsumes several recent methods, and afterwards list some research questions raised by the framework.
Example 5.1.1 (Sketch of a framework for cooperation in social dilemmas).
The setting is a 2-agent decision process. At each timestep t, each agent i receives an observation oti; takes an action ati=πi(oti) based on their policy πi (assumed to be deterministic for simplicity); and receives a reward rti. Player i expects to get a value of Vi(π1,π2) if the policies π1,π2 are deployed. Examples of such environments which are amenable to study with contemporary machine learning tools are the "sequential social dilemmas'' introduced by Leibo et al. (2017). These include a game involving potential conflict over scarce resources, as well as a coordination game similar in spirit to Stag Hunt (Table 1).
Suppose that the agents (or their overseers) have the opportunity to choose what policies to deploy by simulating from a model, and to bargain over the choice of policies. The idea is for the parties to arrive at a welfare function w which they agree to jointly maximize; deviations from the policies which maximize the welfare function will be punished if detected. Let Vdi be a "disagreement point'' measuring how well agent i expects to do if they deviate from the welfare-maximizing policy profile. This could be their security value maxπ1minπ2Vi(π1,π2) or an estimate of their value when the agents use independent learning algorithms. Finally, define player i's ideal point V∗i=maxπ1,π2Vi(π1,π2). Table 5 displays welfare functions corresponding to several widely-discussed bargaining solutions, adapted to the multi-agent reinforcement learning setting.
Define the cooperative policies as πC1,πC2=argmaxπ1,π2w(π1,π2). We need a way of detecting defections so that we can switch from the cooperative policy πC1 to a punishment policy. Call a function that detects defections a "switching rule''. To make the framework general, consider switching rules χ which return 1 for Switch and 0 for Stay. Rules χ depend on the agent's observation history Hti. The contents of Hti will differ based on the degree of observability of the environment, as well as how transparent agents are to each other (cf. Table 6). Example switching rules include:
Finally, the agents need punishment policies πDi to switch to in order to disincentivize defections. An extreme case of a punishment policy is the one in which an agent commits to minimizing their counterpart's utility once they have defected: πD,minimax1=argminπ1maxπ2V2(π1,π2). This is the generalization of the so-called "grim trigger'' strategy underlying the classical theory of iterated games (Friedman, 1971; Axelrod, 2000). It can be seen that each player submitting a grim trigger strategy in the above framework constitutes a Nash equilibrium in the case that the counterpart's observations and actions are visible (and therefore defections can be detected with certainty). However, grim trigger is intuitively an extremely dangerous strategy for promoting cooperation, and indeed does poorly in empirical studies of different strategies for the iterated Prisoner's Dilemma (Axelrod and Hamilton, 1981). One possibility is to train more forgiving, tit-for-tat-like punishment policies, and play a mixed strategy when choosing which to deploy in order to reduce exploitability.
Some questions facing a framework for solving social dilemmas among deep reinforcement learners, such as that sketched in Example 5.1.1, include:
In addition to the theoretical development of open-source game theory (Section 3.2), interactions between transparent agents can be studied using tools like deep reinforcement learning. Learning equilibrium (Brafman and Tennenholtz, 2003) and learning with opponent-learning awareness (LOLA) (Foerster et al., 2018; Baumann et al.,2018; Letcher et al., 2018) are examples of analyses of learning under transparency.
5.2 Multi-agent training
Multi-agent training is an emerging paradigm for the training of generally intelligent agents (Lanctot et al., 2017; Rabinowitz et al., 2018; Suarez et al., 2019; Leibo et al.,2019). It is as yet unclear what the consequences of such a learning paradigm are for the prospects for cooperativeness among advanced AI systems.
5.3 Decision theory
Understanding the decision-making procedures implemented by different machine learning algorithms may be critical for assessing how they will behave in high-stakes interactions with humans or other AI agents. One potentially relevant factor is the decision theory implicitly implemented by a machine learning agent. We discuss decision theory at greater length in Section 7.2, but briefly: By an agent’s decision theory, we roughly mean which dependences the agent accounts for when predicting the outcomes of its actions. While it is standard to consider only the causal effects of one’s actions ("causal decision theory’’ (CDT)), there are reasons to think agents should account for non-causal evidence that their actions provide about the world [4]. And, different ways of computing the expected effects of actions may lead to starkly different behavior in multi-agent settings.
6 Humans in the loop [6]
TAI agents may acquire their objectives via interaction with or observation of humans. Relatedly, TAI systems may consist of AI-assisted humans, as in Drexler (2019)’s comprehensive AI services scenario. Relevant AI techniques include:
In human-in-the-loop scenarios, human responses will determine the outcomes of opportunities for cooperation and conflict.
6.1 Behavioral game theory
Behavioral game theory has often found deviations from theoretical solution concepts among human game-players. For instance, people tend to reject unfair splits in the ultimatum game despite this move being ruled out by subgame perfection (Section 3). In the realm of bargaining, human subjects often reach different bargaining solutions than those standardly argued for in the game theory literature (in particular, the Nash (Nash,1950) and Kalai-Smorodinsky (Kalai et al., 1975) solutions) (Felsenthal and Diskin,1982; Schellenberg, 1988). Thus the behavioral game theory of human-AI interaction in critical scenarios may be a vital complement to theoretical analysis when designing human-in-the-loop systems.
6.2 AI delegates
In one class of TAI trajectories, humans control powerful AI delegates who act on their behalf (gathering resources, ensuring safety, etc.). One model for powerful AI delegates is Christiano (2016a)’s (recursively titled) "Humans consulting HCH’’ (HCH). Saunders (2019) explains HCH as follows:
A particularly concerning class of cooperation failures in such scenarios are threats by AIs or AI-assisted humans against one another.
Saunders also discusses a hypothetical manual for overseers in the HCH scheme. In this manual, overseers could find advice "on how to corrigibly answer questions by decomposing them into sub-questions.’’ Exploring practical advice that could be included in this manual might be a fruitful exercise for identifying concrete interventions for addressing cooperation failures in HCH and other human-in-the-loop settings. Examples include:
Acknowledgements & References
Cf. Christiano (2016b)'s discussion of "prosaic’’ artificial general intelligence, defined as that "which doesn’t reveal any fundamentally new ideas about the nature of intelligence or turn up any 'unknown unknowns' ". ↩︎
Although Foerster et al. (2018) develop a version of LOLA with "opponent modeling'' where an agent only makes inferences about their counterpart's parameters, rather than actually seeing them. Zhang andLesser (2010) present a similar method, though unlike LOLA theirs does not attempt to shape the counterpart's update. ↩︎
Rabin (1993); Fehr and Schmidt (1999); Bolton and Ockenfels (2000) study fairness and trust; Camererand Hua Ho (1999) develop a large class of models for explaining human learning in games; and Camerer (2008, Ch. 4) reviews the behavioral literature on bargaining, concluding that a satisfactory theory of bargaining would "probably weave together perceptions of equity…, stable social preferences for equal payoffs or fair treatment, heuristic computation, and individual differences...’’. Also see the discussion of behavioral game theory and human evolution by Hagen and Hammerstein (2006) and references therein. ↩︎
See also Camerer and Hua Ho (1999)’s distinction between "the law of actual effect’’ and "the law of simulated effect’’. ↩︎
For example, (Everitt et al., 2015) develop sequential extensions of the most commonly studied decision theories, causal and evidential decision theory, in a general reinforcement learning framework. One could develop similar extensions for model-based multi-agent frameworks, like Gmytrasiewicz and Doshi (2005)’s interactive partially observable Markov decision processes. ↩︎
Notes by Lukas Gloor contributed substantially to the content of this section. ↩︎
This has been illustrated in the p-Beauty Contest Game (BCG). In the BCG, multiple players simultaneously say a number between 0 and 100. The winner is the person whose number is closest to the mean of all the numbers, times a commonly known number p in (0,1). If there is a tie, the payoff is divided evenly. This game has a single Nash equilibrium: everyone says 0. However, human players typically don’t play this way. Instead, experimental evidence suggests that players model others as reasoning fewer steps ahead than they (“If the know I choose X then they will choose Y, so then I will choose Z instead…”), and then choose the best response to these predicted moves (Nagel, 1995). ↩︎
Iterated (Distillation and) Amplification (IDA) is Christiano (2018b)’s proposal for training aligned AI systems. In brief, it consists of iterating a Distillation step in which the capabilities of a team of AI delegates are distilled into a single agent; and an Amplification step, in which the capabilities of the distilled agent are amplified by copying that agent many times and delegating different tasks to different copies. The hope for IDA as an approach to AI safety is that many slightly less-capable agents will be able to control the more powerful agent produced by the latest Distill step, at each iteration of the process. See Cotra (2018) for an accessible overview of IDA. ↩︎