Thanks for sharing this update. Possibly a stupid question: Do you have thoughts on whether cooperative inverse reinforcement learning could help address some of the concerns with identifiability?
There are a set of problems which come from agents intentionally misrepresenting their preferences. But it seems like at least some problems come from agents failing to successfully communicate their preferences, and this seems very analogous to the problem CIRL is attempting to address.
Introduction
We at the Center on Long-Term Risk (CLR) are focused on reducing risks of cosmically significant amounts of suffering[1], or s-risks, from transformative artificial intelligence (TAI). We currently believe that:
This was the subject of our research agenda on cooperative, conflict, and TAI. In this post, I’d like to give an overview of some of the research that CLR has been doing on multi-agent systems. I’ll then briefly remark on the importance, tractability, and neglectedness of this work (both from a downside-focused and more mainstream longtermist perspective), though a thorough discussion of prioritization is beyond the scope of this post. The main goal is to inform the community about what we’ve been up to recently in this space.
We’re interested in supporting people who want to contribute to this work. If you might benefit from financial support to work on topics related to the research described here, you can fill out this form. If you’re interested in working with us as a temporary summer research fellow, full-time researcher, or research assistant, apply here.
Potential causes of conflict between intelligent actors
It is possible that TAI systems will find themselves in conflict, despite the fact that conflict is often seemingly Pareto-inefficient. Factors that might lead intelligent agents to become engaged in conflict include:
Just as humans have failed to avoid catastrophic conflict throughout history, factors 1-4 could lead to catastrophic conflict between agents who are aligned to their respective human principals. Factor 5 could manifest as an alignment failure.
I think these factors make catastrophic conflict between AI systems more likely than many longtermists appreciate, because they demonstrate how intelligence may not be sufficient to avoid conflict. Nevertheless, some very smart people disagree, and thoroughly making this argument is beyond the scope of this post. Throughout the post I will just assume that the chances of catastrophic conflict involving AI systems are high enough to warrant the concern of longtermists.
CLR’s work on multi-agent systems is aimed at intervening in the design of TAI systems to make them less subject to these conflict-conducive factors, especially insofar as these factors could lead to great suffering.
CLR’s recent work
Incompatible commitments and beliefs
Much of our recent thinking on multi-agent systems has to do with factors 4 (incompatible commitments) and 1 (uncertainty about resolve) listed above. Agents are, on the face of it, less likely to reach a mutually acceptable agreement if they don’t have the same model of their strategic situation and equilibrium selection procedure (which are cases of “beliefs” and “commitments”, respectively). Because this may not hold by default, we want to understand how to make it more likely that agents have sufficiently compatible beliefs and commitments to avoid catastrophic bargaining failure. The commitment races problem also points to how incompatible commitments could be made and lead to disaster.
In another post, I wrote about the “equilibrium and prior selection problems”, which are instances of incompatible commitments and incompatible beliefs. These refer to cases in which agents have
different ways of selecting among equilibria (corresponding to different criteria for deciding whether an agreement is acceptable) and different priors over each others’ private information, leading to disagreements over (say) the credibility of a threat and in turn the execution of that threat.
One reason to think that highly intelligent agents may have beliefs about each other that diverge to a significant enough extent to cause catastrophic failures of cooperation is that models of other agents may be badly underdetermined by data. This is especially true in strategic settings (where there are incentives to mask your private information).
A framework for (partially) addressing these problems is collaborative game specification (CGS). CGS is a family of schemes by which agents can come to agree on a common model of their strategic situation, and use this to reach an agreement. There are two major benefits of CGS. First, agents who can agree upon an explicit game-theoretic model can apply formal solution concepts to these models to reach agreements (as opposed to just communicating in natural language). This in turn makes it easier to coordinate on standards for arriving at agreements. Second, it helps agents overcome potentially catastrophic differences in their perceptions (for instance, of the credibility of a threat), including by allowing agents to agree on priors over private information.
However, there are several possible downsides and limitations to using CGS, so more strategic and technical work is needed to understand its value as an intervention; see Section 3 of the linked post for some discussion. Also note that CGS does not solve the problem that agents may have incompatible commitments in bargaining. But it does provide a relatively clear target for coordination in order to avoid these incompatibilities. Future work on this will include both investigating the pros and cons of using something like CGS and technical work on how it might be implemented.
Multi-agent learning
We have also been trying to place some of our concerns in the context of multi-agent learning, and to build infrastructure that will help us and others do more technical work that’s relevant to cooperation and conflict in the context of TAI.
In Stastny et al., we discuss mixed-motive coordination problems (MCPs), which are environments that naturally give rise to incompatible commitments[4]. While multi-agent reinforcement learning (MARL) has historically placed a lot of emphasis on purely competitive (zero-sum) games or purely cooperative games, most real-world interactions are mixed-motive. There has been a fair amount in recent years on sequential social dilemmas (SSDs) (of which the iterated Prisoner’s Dilemma is an example), which are mixed-motive games. However, these environments fail to capture some essential hurdles for cooperation. While the SSDs that have been studied so far have a single compelling “cooperative” outcome, real-world problems have many outcomes which might be “cooperative” but which the different agents involved nevertheless disagree on.
We point out that several MARL algorithms that achieve cooperation in SSDs nevertheless fail in MCPs, in the sense that policies produced by independent training by these algorithms (mimicking agents trained and deployed by different AI developers) often result in Pareto-suboptimal payoffs. Figure 1 illustrates this in the case of Bach or Stravinsky, where (unsurprisingly) the LOLA algorithm often leads to Pareto-dominated outcomes between independently trained policies, while this is not the case for the iterated Prisoner’s Dilemma. We also present a partial solution, in which AI systems train policies corresponding to different bargaining solutions and carry out a meta-bargaining protocol to avoid miscoordination.
Figure 1. Comparison of LOLA in self-play (policies from same training run) and cross-play (policies from different training runs), in the iterated Battle of the Sexes (an MCP) and iterated Prisoner’s Dilemma (not an MCP). Shaded regions are all the attainable payoff profiles in the respective games. Points are jittered.
In addition to illustrating an important and neglected class of environments, our MCP working paper is intended to help shape multi-agent ML research by arguing for the importance of two criteria for the evaluation of multi-agent learning algorithms. The (generalized) cross-play score of an algorithm is roughly the average performance of policies produced by independent runs of that algorithm (the “generalized” part accounts for the fact that algorithms may take as input different welfare functions, leading to further risk of miscoordination).
In contrast to the work above, this working paper on cooperation in “learning games” by Maxime Riché and myself applies to cases in which agents learn their world-models online. One direction for promoting cooperation in this setting is to have principals coordinate on a welfare function to optimize, and a set of algorithms for optimizing that welfare function (including updating world-models). The idea is to make sure agents are always on the same page about what the current best-guess at the welfare-optimizing policy profile is. Incentives to deviate from this agreement must be handled. This is done analogously to how cooperation is enforced in typical repeated games: actors agree on a profile of learning algorithms to deploy, and punish their counterpart if they find that they’re defecting from the agreed-upon profile. (I don’t think the algorithm in that paper has much real-world applicability. I think the value comes from illustrating strategic considerations at play in the design of online learning agents, as well as helping Maxime and me skill-up in MARL).
Lastly, we’ve started building a repository of algorithms and learning environments to aid the kind of research we’d like to see more of. You can find it here.
Surrogate goals and safe Pareto improvements
The notion of surrogate goals was initially developed at CLR by Caspar Oesterheld and Tobias Baumann. The idea is that an agent can adopt goals other than their terminal goals, such that agents attempting to coerce them will threaten these “surrogate” goals rather than the goals they actually care about. (An example of a surrogate goal might be “make sure this harmless button isn’t pressed”.) Then, intuitively, if both the target and threatener behave in the same way towards this surrogate goal as they do the target’s terminal goals, then both agents involved in a coercive interaction can end up at least as well off as they would if the threat were directed at the terminal goal. (In particular, when agents are simultaneously committed to carry out and not give into threats, harmless and cheap threats against the surrogate goal are carried out — e.g., the harmless button is pressed — as opposed to the original, very costly threat.)
Caspar has recently formalized surrogate goals in his safe Pareto improvement (SPI) framework. In this framework, agents have default strategies, which might be things like “never give into threats” and “always carry out threats”. SPIs are designed to improve outcomes for all players in the event that their default strategies lead to conflict. An SPI is a transformation of a game which, when played according to the agents’ default strategies, leads with certainty to a Pareto improvement over the payoffs achieved under the default strategies applied to the original game. This framework relies on agents being mutually transparent. A nice feature of SPIs is that an agent can conclude that they will improve over their default policy by adopting an SPI without having probabilistic beliefs about future bargaining encounters.
Robust motivations to cooperate
Much of our thinking on interventions thus far has had to do with constructing methods for bargaining (e.g., safe Pareto improvements, collaborative game specification) that might allow agents to avoid catastrophic outcomes. However, misaligned agents may lack the motivation to use these techniques in the first place. Conversely, agents who are highly motivated to find efficient agreements might be expected to figure out good bargaining strategies on their own. So we want agents to be motivated to avoid catastrophically uncooperative outcomes, which includes maintaining those motivations even if they are otherwise misaligned.
We plan to think more about ways in which motivations to avoid catastrophic cooperation failure can be maintained even if a system is misaligned. One direction is the construction of better environments and policies for the demonstration of cooperative behavior (e.g., using something like CGS to construct good bargaining policies in a reinforcement learning setting; drafting sections of an overseer's manual for amplification-based approaches). Another direction is adversarial training directed at finding instances of cooperation failure. This might involve understanding how to construct transparency tools for seeing how agents reason about other agents. (There may be unique technical challenges when it comes to understanding cognition about other minds.) Finally, we are also developing a better understanding of how preferences for deliberately harming other agents might arise.
Strategic origins of preferences to harm[5]
Some humans, at least under some conditions, seem to be indifferent to harming other agents, or even intrinsically value causing harm. One cluster of explanations for this has to do with the adaptive value of such preferences in certain evolutionary environments. The literatures on the evolutionary origins of aggression, spite, and retributivism suggest a few possible such mechanisms:
Similar strategic pressures may be present in some AI training regimes. Perhaps particularly worrying cases are those similar to human evolution, such as multi-agent autocurricula (also see Richard Ngo’s work on this in the context of AI safety). Also see the brief discussion of analogues for malevolence in TAI here. An important open question for this research is: How could AI systems end up with preferences to harm specifically in ways that lead to vast amounts of welfare loss?
AI governance and coordination
As highlighted in the discussion of incompatible commitments and beliefs, guarantees on the ability of agents to reach mutually beneficial agreements are not available in the absence of coordination on how agents model strategic situations and decide among Pareto-optimal agreements. Thus, it may be valuable for AI developers to coordinate on the standards their agents will use in bargaining situations. We are currently investigating different ways in which AI developers can come to agree on these standards; Stefan Torges has written some initial thoughts on this here.
Some reasons to work on this
Importance
Preventing catastrophic conflict and conflict-conducive preferences seems especially valuable by the lights of downside-focused value systems, as these seem to be among the most plausible ways by which astronomical amounts of suffering might come to exist. That said, even if you don't weight suffering reduction much more than happiness-creation (and even if you don't care about reducing suffering at all) conflict reduction may still be extremely valuable, because conflict can pose a direct existential risk, constitutes an existential risk factor, and could waste enough resources that civilization falls significantly short of its potential.
Beyond the prevention of catastrophic conflict per se, understanding the safety of multi-agent systems is arguably quite important by upside-focused lights, as evidenced by several research agendas in this area: Open Problems in Cooperative AI; ARCHES and Some AI research areas for human existential safety; and Richard Ngo’s work on multi-agent AI safety.
Still, it seems safe to say that work on avoiding catastrophic conflict is more clearly important by downside-focused lights than on more mainstream longtermist views.
Tractability
There is a prima facie argument that work directed at preventing catastrophic conflict is more tractable than work on alignment and Cooperative AI more generally: it should be easier to at least prevent catastrophic conflict, rather than prevent catastrophic conflict and also get all of the other things that we want. But there are a few counterarguments:
Even insofar as research on catastrophic conflict is a subset of alignment and Cooperative AI work, it may not be true that the most tractable way to make progress on catastrophic conflict is to approach it directly, but rather to work on the more general problems; Insofar as we are trying to avoid catastrophic conflict conditional on alignment having otherwise failed, this may actually be harder than just preventing the probability of misalignment in the first place.
Still, the prima facie argument, in combination with the fact that angles focused on catastrophic conflict have been explored very little thus far (see Neglectedness), suggests to me that there is a lot of value of information from this approach.
Neglectedness
There is very little technical work outside of CLR focused on preventing misalignment that leads to large amounts of suffering. This includes alignment research targeted at ensuring that agents at least do not have conflict-conducive preferences. Given the distribution of priorities among longtermists and AI safety researchers I expect this to stay the same for the foreseeable future.
Relative to alignment, there is also very little work on avoiding catastrophic cooperation failures between systems that are aligned to their principals. (Though this is increasingly getting attention, for instance with the research agendas cited above.)
Defer to more capable successors?
A natural response to work on avoiding catastrophic cooperation failure involving TAI is that it can be deferred to more capable successors. For instance, humans with access to intelligent and aligned AI assistants would be in a much better position than we are to design more advanced systems that are capable of cooperating. So we should focus on aligning these initial intelligent systems rather than working on cooperation now. Or, sufficiently intelligent systems will already be competent at cooperation because they are much smarter than we are.
I think it is true that the ability to cooperate is often instrumentally useful and so superintelligent agents will often find ways of cooperating. I also agree that, e.g., humans with aligned superhuman AI assistants will be better-placed than we to improve AIs’ ability to cooperate. Nevertheless, I think there are some compelling reasons for working on this already:
How we expect this work to have an impact
Acknowledgements
Thanks to Emery Cooper, Anthony DiGiovanni, Daniel Kokotajlo, Alex Lyzhov, Chi Nguyen, Caspar Oesterheld, Maxime Riché, and Stefan Torges for their comments on drafts of this post.
References
Boyd, Robert, Herbert Gintis, Samuel Bowles, and Peter J. Richerson. 2003. “The Evolution of Altruistic Punishment.” Proceedings of the National Academy of Sciences of the United States of America 100 (6): 3531–35.
Fey, Mark, and Kristopher W. Ramsay. 2007. “Mutual Optimism and War.” American Journal of Political Science 51 (4): 738–54.
Forber, Patrick, and Rory Smead. 2016. “The Evolution of Spite, Recognition, and Morality.” Philosophy of Science 83 (5): 884–96.
Powell, Robert. 2006. “War as a Commitment Problem.” International Organization 60 (1): 169–203.
Sechser, Todd S. 2018. “Reputations and Signaling in Coercive Bargaining.” The Journal of Conflict Resolution 62 (2): 318–45.
Smead, Rory, and Patrick Forber. 2013. “The Evolutionary Dynamics of Spite in Finite Populations.” Evolution; International Journal of Organic Evolution 67 (3): 698–707.
By "cosmically significant", we mean significant relative to expected future suffering. Note that it may turn out that the amount of suffering we can influence is dwarfed by suffering that we can't influence. By "expected suffering in the future" we mean "expectation of action-relevant suffering in the future". ↩︎
For instance, actors who expect to gain power in the future may not be able to credibly commit to adhering to a negotiated agreement, and therefore their counterpart may find it rational to launch a preemptive attack. The Peloponnesian War has been attributed to Sparta's need to preempt the growing power of Athens. ↩︎
Conflicts due to factors 1-3 can be modeled as occurring in the equilibrium of a game, showing that conflict can happen between agents who are "intelligent" at least in the sense of "game-theoretically rational". Explanations 1 and 2 are among the classic "rationalist explanations for war". ↩︎
Formally we define them as games in which there exist equilibria that maximize different welfare functions but lead to Pareto-suboptimal outcomes when strategies from each equilibrium are played together. ↩︎
This section draws heavily on work done by Eric Chen and Michael Aird when they were research fellows at CLR. ↩︎