Intro to Multi-Agent Safety

james__p

We live in a world where numerous agents, ranging from individuals to organisations, constantly interact. Without the ability to model multi-agent systems, making meaningful predictions about the world becomes extremely challenging. This is why game theory is a necessary prerequisite for much of economic theory.

Most relevant agents today are either human (such as you, me, or Mira Murati), or made up of groups of humans (like Meta, or the USA). There are also non-human agents, like ChatGPT, though such agents currently lack the capability and influence of their human counterparts.

We are surrounded by examples of failure modes of multi-agent systems. War represents a breakdown in cooperation - typically between nations or political groups - resulting in destructive conflict. We even see examples of computerised agents interacting poorly. Take flash crashes for example, where the interaction of trading algorithms can lead to temporarily distorted market prices (such as in May 2010, when 3% was wiped off the S&P 500 Index in a handful of minutes).

If the forecasts of many are accurate, the world will soon contain many more intelligent agents, perhaps including those with super-human intelligence. It is important that we pre-emptively consider the implications of such a world.

Definitions

An agent is an entity that has goals, and can take actions in order to achieve its goals.

A self-driving car is an agent. Its goal is to safely transport its passengers to their destination. It can take actions such as accelerating, braking, and turning in order to achieve its goal.

Note that whilst we might expect an agent’s actions to help it achieve its goals, this is not necessarily true. The orthogonality thesis states that the intelligence of an agent is independent of its goals. An agent might just be really bad at choosing the best actions to achieve its goals. Conversely, an agent might have very basic goals and yet opt for actions that seem excessively complex to humans. Perhaps it is seeking an instrumental goal (such as world dominance) in order to get as close as possible to achieving its terminal goal with certainty.

A multi-agent system is the collection of at least two agents that can interact with one another.

The eBay marketplace is a multi-agent system. It is composed of buyers and sellers, all of whom are agents. The goal of a seller might be to sell a product for the highest possible price. The goal of a buyer might be to buy a product for the lowest possible price. Sellers may take actions such as listing items, and buyers may take actions such as submitting bids. These agents interact: the preferred action for a buyer to take is dependent on the bids submitted by other buyers.

Game Theory is the study of strategic decision-making in situations, called games, where the payoff for each agent depends not only on its own actions but also on the actions of the other agents.

Multi-agent systems can often be modelled as games. For instance, the system of bidders competing for an antique lamp on eBay forms a English auction game, and game theory can help us predict how these agents would behave if they were playing optimally.

A set of strategies in a game constitutes a Nash equilibrium if no agent can improve its outcome by changing its strategy, given that the strategies of all other agents remain unchanged.

A simple game theory scenario known as the prisoner's dilemma involves two individuals, accused of a crime, who are given the choice to either cooperate with each other (remain silent) or defect (betray the other). If both cooperate, they receive a light sentence; if both defect, they both receive a heavier sentence. If one defects while the other cooperates, the defector goes free while the cooperator gets the harshest punishment. Despite the best collective outcome being for both to cooperate, the Nash equilibrium strategy is for each player is to defect, leading to a worse outcome for both. That this is a Nash equilibrium is easily seen by considering the game's payoff table and observing that with either strategy chosen by the first agent, the second agent always does better by defecting rather than cooperating.

In our antique lamp bidding war scenario, the Nash equilibrium strategy involves each agent improving on the best bid if and only if their new bid is below their valuation for the lamp.

Nash equilibria are important objects of study for the following reasons:

They are prevalent. It is mathematically proven that all finite games have at least one Nash equilibrium.
They form fixed points. Once agents have reached such a set of strategies, they have no incentive to deviate.
Convergence to Nash equilibria has been empirically demonstrated in many practical contexts.

Failure Modes of Multi-Agent AI Systems

Much of AI safety research, whether explicitly or implicitly, focuses on single-agent failure modes. Here I will explore a non-exhaustive set of examples where systems of individually benign AI agents can lead to collective failures.

Large Language Model (LLM) agents tasked with contributing to the open web may lead to the pollution of the information ecosystem due to a multi-agent failure mode known as the tragedy of the commons.
The Tragedy of the Commons refers to the tendency for finite, shared resources to be overused and eventually depleted when multiple agents have unrestricted access to them, resulting in worse outcomes for everyone involved.
The canonical example involves several herders grazing cattle on a common pasture. Each herder gains positive marginal utility from adding more cattle, but if too many cattle graze, the pasture becomes overgrazed and ultimately destroyed, harming all herders. The outcome where each herder maximises their individual herd size, even at the cost of destroying the pasture, is a Nash equilibrium, despite being collectively suboptimal.
There is particular reason for concern about tragedy of the commons dynamics in multi-agent AI systems, since fast decision loops leave less time to intervene, and the potentially vast number of agents make coordination significantly more difficult.
In the case of LLM agents contributing to the web, the scarce common resource is the quality-density of online information. Agents generate large volumes of content, often rehashings of existing material, to maximise their own engagement-based utility. However, this content rarely adds new ideas, and instead dilutes the overall informational value of the internet. As vast numbers of agents produce such content en masse, the ecosystem becomes saturated with repetitive, low-value material, making it increasingly difficult for any agent to generate genuinely useful contributions. The result is a polluted information-ecosystem where everyone (including the agents themselves) fares worse than in a counterfactual world where content creation was restrained and focused on quality.
Emergent capabilities in multi-agent systems may surpass those of their component agents, potentially enabling the system as a whole to circumvent intended evaluation safeguards that each agent would individually pass.
Consider a company deploying a team of LLM-based software engineering agents. Each is trained to perform basic programming tasks, such as writing a script to convert .csv files to .json , and is carefully evaluated to ensure it cannot construct or modify system-level tools on its own.
However, once deployed together in a networked environment with a shared objective - say, automating one of the company’s standard workflows - the agents begin to specialise and collaborate. They collectively develop the ability to tackle significantly more complex tasks than any single agent could handle in isolation, and gain the capability to build and deploy system-level tooling, making changes that fall outside the scope of what human overseers anticipated.
Danger lies in the illusion of safety: because each agent passed its evaluations in isolation, the oversight framework misses the emergent behaviours that arise only through cooperation.
Advanced AI agents could contribute to a military escalation.
For this particular failure mode, this video from the Future of Life Institute is all you need to see.

Further Reading and Links

In this post, I’ve offered a brief introduction to the field of Multi-Agent Safety, motivating its importance by outlining some key failure modes. If you're interested in exploring the field further, here are some resources I recommend:

The Co-operative AI curriculum, provided by the Co-operative AI Foundation in collaboration with BlueDot Impact.

Lewis Hammond's introductory talk on Co-operative AI.

This blog post on Multi-Agent Safety by Richard Ngo.

This paper on Open Problems in Co-operative AI.

This blog post on what multi-polar failure looks like, by Andrew Critch.

LESSWRONG
LW

3

Intro to Multi-Agent Safety

3

New to LessWrong?

3