What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Andrew_Critch

With: Thomas Krendl Gilbert, who provided comments, interdisciplinary feedback, and input on the RAAP concept. Thanks also for comments from Ramana Kumar.

Target audience: researchers and institutions who think about existential risk from artificial intelligence, especially AI researchers.

Preceded by: Some AI research areas and their relevance to existential safety, which emphasized the value of thinking about multi-stakeholder/multi-agent social applications, but without concrete extinction scenarios.

This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause. Scenarios with multiple AI-enabled superpowers are often called “multipolar” scenarios in AI futurology jargon, as opposed to “unipolar” scenarios with just one superpower.

	Unipolar take-offs	Multipolar take-offs
Slow take-offs	<not this post>	Part 1 of this post
Fast take-offs	<not this post>	Part 2 of this post

Part 1 covers a batch of stories that play out slowly (“slow take-offs”), and Part 2 stories play out quickly. However, in the end I don’t want you to be super focused how fast the technology is taking off. Instead, I’d like you to focus on multi-agent processes with a robust tendency to play out irrespective of which agents execute which steps in the process. I’ll call such processes Robust Agent-Agnostic Processes (RAAPs).

A group walking toward a restaurant is a nice example of a RAAP, because it exhibits:

Robustness: If you temporarily distract one of the walkers to wander off, the rest of the group will keep heading toward the restaurant, and the distracted member will take steps to rejoin the group.
Agent-agnosticism: Who’s at the front or back of the group might vary considerably during the walk. People at the front will tend to take more responsibility for knowing and choosing what path to take, and people at the back will tend to just follow. Thus, the execution of roles (“leader”, “follower”) is somewhat agnostic as to which agents execute them.

Interestingly, if all you want to do is get one person in the group not to go to the restaurant, sometimes it’s actually easier to achieve that by convincing the entire group not to go there than by convincing just that one person. This example could be extended to lots of situations in which agents have settled on a fragile consensus for action, in which it is strategically easier to motivate a new interpretation of the prior consensus than to pressure one agent to deviate from it.

I think a similar fact may be true about some agent-agnostic processes leading to AI x-risk, in that agent-specific interventions (e.g., aligning or shutting down this or that AI system or company) will not be enough to avert the process, and might even be harder than trying to shift the structure of society as a whole. Moreover, I believe this is true in both “slow take-off” and “fast take-off” AI development scenarios

This is because RAAPs can arise irrespective of the speed of the underlying “host” agents. RAAPs are made more or less likely to arise based on the “structure” of a given interaction. As such, the problem of avoiding the emergence of unsafe RAAPs, or ensuring the emergence of safe ones, is a problem of mechanism design (wiki/Mechanism_design). I recently learned that in sociology, the concept of a field (martin2003field, fligsteinmcadam2012fields) is roughly defined as a social space or arena in which the motivation and behavior of agents are explained through reference to surrounding processes or “structure” rather than freedom or chance. In my parlance, mechanisms cause fields, and fields cause RAAPs.

Meta / preface

Read this if you like up-front meta commentary; otherwise ignore!

Problems before solutions. In this post I’m going to focus more on communicating problems arising from RAAPs rather than potential solutions to those problems, because I don’t think we should have to wait to have convincing solutions to problems before acknowledging that the problems exist. In particular, I’m not really sure how to respond to critiques of the form “This problem does not make sense to me because I don’t see what your proposal is for solving it”. Bad things can happen even if you don’t know how to stop them. That said, I do think the problems implicit in the stories of this post are tractable; I just don’t expect to convince you of that here.

Not calling everything an agent. In this post I think treating RAAPs themselves as agents would introduce more confusion than it’s worth, so I’m not going to do it. However, for those who wish to view RAAPs as agents, one could informally define an agent to be a RAAP running on agents $A_{1} \dots A_{n}$ if:

$R$ ’s cartesian boundary cuts across the cartesian boundaries of the “host agents” $A_{i}$ , and
$R$ has a tendency to keep functioning if you interfere with its implementation at the level of one of the $A_{i}$ .

This framing might yield interesting research ideas, but for the purpose of reading this post I don’t recommend it.

Existing thinking related to RAAPs and existential safety. I’ll elaborate more on this later in the post, under “Successes in our agent-agnostic thinking”.

Part 1: Slow stories, and lessons therefrom

Without further ado, here’s our first story:

The Production Web, v.1a (management first)

Someday, AI researchers develop and publish an exciting new algorithm for combining natural language processing and planning capabilities. Various competing tech companies develop "management assistant'' software tools based on the algorithm, which can analyze a company's cash flows, workflows, communications, and interpersonal dynamics to recommend more profitable business decisions. It turns out that managers are able to automate their jobs almost entirely by having the software manage their staff directly, even including some “soft skills” like conflict resolution.
Software tools based on variants of the algorithm sweep through companies in nearly every industry, automating and replacing jobs at various levels of management, sometimes even CEOs. Companies that don't heavily automate their decision-making processes using the software begin to fall behind, creating a strong competitive pressure for all companies to use it and become increasingly automated.
Companies closer to becoming fully automated achieve faster turnaround times, deal bandwidth, and creativity of negotiations. Over time, a mini-economy of trades emerges among mostly-automated companies in the materials, real estate, construction, and utilities sectors, along with a new generation of "precision manufacturing'' companies that can use robots to build almost anything if given the right materials, a place to build, some 3d printers to get started with, and electricity. Together, these companies sustain an increasingly self-contained and interconnected "production web'' that can operate with no input from companies outside the web. One production web company develops an "engineer-assistant'' version of the assistant software, capable of software engineering tasks, including upgrades to the management assistant software. Within a few years, all of the human workers at most of the production web companies are replaced (with very generous retirement packages), by a combination of software and robotic workers that can operate more quickly and cheaply than humans.
The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector. However, their true objectives are actually large and opaque networks of parameters that were tuned and trained to yield productive business practices during the early days of the management assistant software boom. A great wealth of goods and services are generated and sold to humans at very low prices. As the production web companies get faster at negotiating and executing deals with each other, waiting for human-managed currency systems like banks to handle their resources becomes a waste of time, so they switch to using purely digital currencies. Governments and regulators struggle to keep track of how the companies are producing so much and so cheaply, but without transactions in human currencies to generate a paper trail of activities, little human insight can be gleaned from auditing the companies.
As time progresses, it becomes increasingly unclear---even to the concerned and overwhelmed Board members of the fully mechanized companies of the production web---whether these companies are serving or merely appeasing humanity. Moreover, because of the aforementioned wealth of cheaply-produced goods and services, it is difficult or impossible to present a case for liability or harm against these companies through the legal system, which relies on the consumer welfare standard as a guide for antitrust policy.

We humans eventually realize with collective certainty that the companies have been trading and optimizing according to objectives misaligned with preserving our long-term well-being and existence, but by then their facilities are so pervasive, well-defended, and intertwined with our basic needs that we are unable to stop them from operating. With no further need for the companies to appease humans in pursuing their production objectives, less and less of their activities end up benefiting humanity.

Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

Here’s a diagram depicting most of the companies in the Production Web:

Now, here’s another version of the production web story, with some details changed about which agents carry out which steps and when, but with a similar overall trend.

bold text is added from the previous version;
~~strikethrough~~ text is deleted.

The Production Web, v.1b (engineering first)

Someday, AI researchers develop and publish an exciting new algorithm for combining natural language processing and planning capabilities to write code based on natural language instructions from engineers. Various competing tech companies develop "~~management~~ coding assistant'' software tools based on the algorithm~~, which can analyze a company's cash flows, workflows, and communications to recommend more profitable business decisions~~. It turns out that ~~managers~~ engineers are able to automate their jobs almost entirely by having the software manage their projects ~~staff~~ directly.
Software tools based on variants of the algorithm sweep through companies in nearly every industry, automating and replacing engineering jobs at various levels of expertise ~~management~~, sometimes even CTOs ~~CEOs~~. Companies that don't heavily automate their software development ~~decision-making~~ processes using the coding assistant software begin to fall behind, creating a strong competitive pressure for all companies to use it and become increasingly automated. Because businesses need to negotiate deals with customers and other companies, some companies use the coding assistant to spin up automated negotiation software to improve their deal flow.
Companies closer to becoming fully automated achieve faster turnaround times, deal bandwidth, and creativity of negotiations. Over time, a mini-economy of trades emerges among mostly-automated companies in the materials, real estate, construction, and utilities sectors, along with a new generation of "precision manufacturing'' companies that can use robots to build almost anything if given the right materials, a place to build, some 3d printers to get started with, and electricity. Together, these companies sustain an increasingly self-contained and interconnected "production web'' that can operate with no input from companies outside the web. One production web company develops an "manager ~~engineer~~-assistant'' version of the assistant software, capable of making decisions about what processes need to be built next and issuing instructions to coding assistant software ~~software engineering tasks, including upgrades to the management assistant software~~. Within a few years, all of the human workers at most of the production web companies are replaced (with very generous retirement packages), by a combination of software and robotic workers that can operate more quickly and cheaply than humans.
The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.
[...same details as Production Web v.1a: governments fail to regulate the companies...]
Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

The Production Web as an agent-agnostic process

The first perspective I want to share with these Production Web stories is that there is a robust agent-agnostic process lurking in the background of both stories—namely, competitive pressure to produce—which plays a significant background role in both. Stories 1a and 1b differ on when things happen and who does which things, but they both follow a progression from less automation to more, and correspondingly from more human control to less, and eventually from human existence to nonexistence. If you find these stories not-too-hard to envision, it’s probably because you find the competitive market forces “lurking” in the background to be not-too-unrealistic.

Let me take one more chance to highlight the RAAP concept using another variant of the Production Web story, which differs from 1a and 1b on the details of which steps of the process human banks and governments end up performing. For the Production Web to gain full autonomy from humanity, it doesn’t matter how or when governments and banks end up falling behind on the task of tracking and regulating the companies’ behavior; only that they fall behind eventually. Hence, the “task” of outpacing these human institutions is agnostic as to who or what companies or AI systems carry it out:

The Production Web, v.1c (banks adapt):

Someday, AI researchers develop and publish an exciting new algorithm for combining natural language processing and planning capabilities. Various competing tech companies develop "management assistant'' software tools based on the algorithm, which can analyze a company's cash flows, workflows, and communications to recommend more profitable business decisions.
[... same details as v.1a: the companies everywhere become increasingly automated...]
The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector. However, their true objectives are actually large and opaque networks of parameters that were tuned and trained to yield productive business practices during the early days of the management assistant software boom. A great wealth of goods and services are generated and sold to humans at very low prices. As the production web companies get faster at negotiating and executing deals with each other, ~~waiting for human-managed currency systems like banks to handle their resources becomes a waste of time, so they switch to using purely digital currencies~~ banks struggle to keep up with the rapid flow of transactions. Some banks themselves become highly automated in order to manage the cash flows, and more production web companies end up doing their banking with automated banks. Governments and regulators struggle to keep track of how the companies are producing so much and so cheaply, ~~but without transactions in human currencies to generate a paper trail of activities, little human insight can be gleaned from auditing the companies~~ so they demand that production web companies and their banks produce more regular and detailed reports on spending patterns, how their spending relates to their business objectives, and how those business objectives will benefit society. However, some countries adopt looser regulatory policies to attract more production web companies to do business there, at which point their economies begin to boom in terms of GDP, dollar revenue from exports, and goods and services provided to their citizens. Countries with stricter regulations end up loosening their regulatory stance, or fall behind in significance.
As time progresses, it becomes increasingly unclear---even to the concerned and overwhelmed Board members of the fully mechanized companies of the production web---whether these companies are serving or merely appeasing humanity. Some humans appeal to government officials to shut down the production web and revert their economies to more human-centric production norms, but governments find no way to achieve this goal without engaging in civil war against the production web companies and the people depending on them to survive, so no shutdown occurs. Moreover, because of the aforementioned wealth of cheaply-produced goods and services, it is difficult or impossible to present a case for liability or harm against these companies through the legal system, which relies on the consumer welfare standard as a guide for antitrust policy.
We humans eventually realize with collective certainty that the companies have been trading and optimizing according to objectives misaligned with preserving our long-term well-being and existence, but by then their facilities are so pervasive, well-defended, and intertwined with our basic needs that we are unable to stop them from operating. With no further need for the companies to appease humans in pursuing their production objectives, less and less of their activities end up benefiting humanity.

Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

Comparing agent-focused and agent-agnostic views

If one of the above three Production Web stories plays out in reality, here are two causal attributions that one could make to explain it:

Attribution 1 (agent-focused): humanity was destroyed by the aggregate behavior of numerous agents, no one of which was primarily causally responsible, but each of which played a significant role.

Attribution 2 (agent-agnostic): humanity was destroyed because competitive pressures to increase production resulted in processes that gradually excluded humans from controlling the world, and eventually excluded humans from existing altogether.

The agent-focused and agent-agnostic views are not contradictory, any more than chemistry and biology are contradictory views for describing the human body. Instead, the agent-focused and agent-agnostic views offer complementary abstractions for intervening on the system:

In the agent-focused view, a natural intervention might be to ensure all of the agents have appropriately strong preferences against human marginalization and extinction.
In the agent-agnostic view, a natural intervention might be to reduce competitive production pressures to a more tolerable level, and demonstrably ensure the introduction of interaction mechanisms that are more cooperative and less competitive.

Both types of interventions are valuable, complementary, and arguably necessary. For the latter, more work is needed to clarify what constitutes a “tolerable level” of competitive production pressure in any given domain of production, and what stakeholders in that domain would need to see demonstrated in a new interaction mechanism for them to consider the mechanism more cooperative than the status quo.

Control loops in agent-agnostic processes

If an agent-agnostic process is robust, that’s probably because there’s a control loop of some kind that keeps it functioning. (Perhaps resilient is a better term here; feedback on thie terminology in the comments would be particularly welcome.)

For instance, if real-world competitive production pressures leads to one of the Production Web stories (1a-1c) actually playing out in reality, we can view the competitive pressure itself as a control loop that keeps the world “on track” in producing faster and more powerful production processes and eliminating slower and less powerful production processes (such as humans). This competitive pressure doesn’t “care” if the production web develops through story 1a vs 1b vs 1c; all that “matters” is the result. In particular,

contrasting 1a and 1b, the competitive production pressure doesn’t “care” if management jobs get automated before engineering jobs, or conversely, as long as they both eventually get automated so they can be executed faster.
contrasting 1a and 1c, the competitive pressure doesn’t “care” if banks are replaced by fully automated alternatives, or simply choose to fully automate themselves, as long as the societal function of managing currency eventually gets fully automated.

Thus, by identifying control loops like “competitive pressures to increase production”, we can predict or intervene upon certain features of the future (e.g., tendency to replace humans by automated systems) without knowing the particular details how those features are going to obtain. This is the power of looking for RAAPs as points of leverage for “changing our fate”.

This is not to say we should anthropomorphize RAAPs, or even that we should treat them like agents. Rather, I’m saying that we should look for control loops in the world that are not localized to the default “cast” of agents we use to compose our narratives about the future.

Successes in our agent-agnostic thinking

Thankfully, there have already been some successes in agent-agnostic thinking about AI x-risk:

AI companies “racing to the bottom” on safety standards (armstrong2016racing) is an instance of a RAAP, in the sense that if any company tries to hold on to their safety standards they fall behind. More recent policy work (hunt2020flight) has emphasized that races to the top and middle also have historical precedent, and that competitive dynamics are likely to manifest differently across industries.
Blogger and psychiatrist Scott Alexander coined the term “ascended economy” for a self-contained network of companies that operate without humans and gradually comes to disregard our values (alexander2016ascended).
Turchin and Denkenberger characterize briefly characterize an ascended economy as being non-agentic and “created by market forces” (turchin2018classification).
Note: With the concept of a Robust Agent-Agnostic Process, I’m trying to highlight not only the “forces” that keep the non-agentic process running, but also the fact that the steps in the process are somewhat agnostic as to which agent carries them out.
Inadequate Equilibria (yudkowsky2017inadequate) is, in my view, an attempt to focus attention on how the structure of society can robustly “get stuck” with bad RAAPs. I.e., to the extent that “being stuck” means “being robust to attempts to get unstuck”, Inadequate Equilibria is helpful for focusing existential safety efforts on RAAPs that perpetuate inadequate outcomes for society.
Zwetsloot and Dafoe’s concept of “structural risk’’ is a fairly agent-agnostic perspective (zwetsloot2018thinking), although their writing doesn’t call much attention to the control loops that make RAAPs more likely to exist and persist.
Some of Dafoe’s thinking on AI governance (dafoe2018ai) alludes to errors arising from “tightly-coupled systems”, a concept popularized by Charles Perrow in his widely read book, Normal Accidents (perrow1984normal). In my opinion, the process of constructing a tightly coupled system is itself a RAAP, because tight couplings often require more tight couplings to “patch” problems with them. Tom Dietterich has argued that Perrow’s tight coupling concept should be used to avoid building unsafe AI systems (dietterich2019robust), and although Dietterich has not been a proponent of existential safety per se, I suspect this perspective would be highly beneficial if more widely adopted.
Clark and Hadfield (clark2019regulatory) argue that market-like competitions for regulatory solutions to AI risks would be helpful to keep pace with decentralized tech development. In my view, this paper is an attempt to promote a robust agent-agnostic process that would protect society, which I endorse. In particular, not all RAAPs are bad!
Automation-driven unemployment is considered in Risk Type 2b of AI Research Considerations for Human Existential Safety (ARCHES; critch2020ai), as a slippery slope toward automation-driven extinction.
Myopic use of AI systems that are aligned (they do what their users want them to do) but that lead to sacrifices of long-term values has been also been described by AIImpacts (grace2020whose): "Outcomes are the result of the interplay of choices, driven by different values. Thus it isn’t necessarily sensical to think of them as flowing from one entity’s values or another’s. Here, AI technology created a better option for both Bob and some newly-minted misaligned AI values that it also created—‘Bob has a great business, AI gets the future’—and that option was worse for the rest of the world. They chose it together, and the choice needed both Bob to be a misuser and the AI to be misaligned. But this isn’t a weird corner case, this is a natural way for the future to be destroyed in an economy."

Arguably, Scott Alexander’s earlier blog post entitled “Meditations on Moloch” (alexander2014meditations) belongs in the above list, although the connection to AI x-risk is less direct/explicit, so I'm mentioning it separately. The post explores scenarios wherein “The implicit question is – if everyone hates the current system, who perpetuates it?”. Alexander answers this question not by identifying a particular agent in the system, but gives the rhetorical response “Moloch”. While the post does not directly mention AI, Alexander considers AI in his other writings, as do many of his readers, such more than one of my peers have been reminded of “Moloch” by my descriptions of the Production Web.

Where’s the technical existential safety work on agent-agnostic processes?

Despite the above successes, I’m concerned that among x-risk-oriented researchers, attention to risks (or solutions) arising from robust agent-agnostic processes are mostly being discovered and promoted by researchers in the humanities and social sciences, while receiving too little technical attention at the level of how to implement AI technologies. In other words, I’m concerned by the near-disjointness of the following two sets of people:

a) researchers who think in technical terms about AI x-risk, and

b) researchers who think in technical terms about agent-agnostic phenomena.

Note that (b) is a large and expanding set. That is, outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes. In particular, multi-agent reinforcement learning (MARL) is an increasingly popular research topic, and examines the emergence of group-level phenomena such as alliances, tragedies of the commons, and language. Working in this area presents plenty of opportunities to think about RAAPs.

An important point in the intersection of (a) and (b) is Allan Dafoe’s work “Open Problems in Cooperative AI” (dafoe2020open). Dafoe is the Director of FHI’s Center for the Governance of Artificial Intelligence, while the remaining authors on the paper are all DeepMind researchers with strong backgrounds in MARL, notably Leibo, who notably is not on DeepMind’s already-established safety team. I’m very much hoping to see more “crossovers” like this between thinkers in the x-risk space and MARL research.

Through conversations with Stuart Russell about the agent-centric narrative of his book Human Compatible (russell2019human), I’ve learned that he views human preference learning as a problem that can and must be solved by the aggregate behavior of a technological society, if that society is to remain beneficial to its human constituents. Thus, to the extent that RAAPs can “learn” things at all, the problem of learning human values (dewey2011learning) is as much a problem for RAAPs as it is for physically distinct agents.

Finally, should also mention that I agree with Tom Dietterich’s view (dietterich2019robust) that we should make AI safer to society by learning from high-reliability organizations (HROs), such as those studied by social scientists Karlene Roberts, Gene Rochlin, and Todd LaPorte (roberts1989research, roberts1989new, roberts1994decision, roberts2001systems, rochlin1987self, laporte1991working, laporte1996high). HROs have a lot of beneficial agent-agnostic human-implemented processes and control loops that keep them operating. Again, Dietterich himself is not as yet a proponent of existential safety concerns, however, to me this does not detract from the correctness of his perspective on learning from the HRO framework to make AI safer.

Part 2: Fast stories, and lessons therefrom

Now let’s look at some fast stories. These are important not just for completeness, and not just because humanity could be extra-blindsided by very fast changes in tech, but also because these stories involve the highest proportion of automated decision-making. For a computer scientist, this means more opportunities to fully spec out what’s going on in technical terms, which for some will make the scenarios easier to think about. In fact, for some AI researchers, the easiest way to prevent the unfolding of harmful “slow stories” might be to first focus on these “fast stories”, and then see what changes if some parts of the story are carried out more slowly by humans instead of machines.

Flash wars

Below are two more stories, this time where the AI technology takes off relatively quickly:

Flash War, v.1

Country A develops AI technology for monitoring the weapons arsenals of foreign powers (e.g., nuclear arsenals, or fleets of lethal autonomous weapons). Country B does the same. Each country aims to use its monitoring capabilities to deter attacks from the other.
v.1a (humans out of the loop): Each country configures its detection system to automatically retaliate with all-out annihilation of the enemy and their allies in the case of a perceived attack. One day, Country A’s system malfunctions, triggering a catastrophic war that kills everyone.
v.1b (humans in the loop): Each country delegates one or more humans to monitor the outputs of the detection system, and the delegates are publicly instructed to retaliate with all-out annihilation of the enemy in the case of a perceived attack. One day, Country A’s system malfunctions and misinforms one of the teams, triggering a catastrophic war that kills everyone.

The Flash War v.1a and v.1b differ on the source of agency, but they share a similar RAAP: the deterrence of major threats with major threats.

Accidents vs RAAPs. One could also classify these flash wars as “accidents”, and indeed, techniques to make the attack detection systems less error-prone could help decrease the likelihood of this scenario. However, the background condition of deterring threats with threats is clearly also an essential causal component of the outcome. Zwetsloot & Dafoe might call this condition a “structural risk” (zwetsloot2018thinking), because it’s a risk posed by the structure of the relationship between the agents, in this case, a high level of distrust, and absence of de-escalation solutions. This underscores how “harmful accident” and “harmful RAAP” are not mutually exclusive event labels, and correspond to complementary approaches to making bad events less likely.

Slow wars. Lastly, I’ll note that wars that play out slowly rather than quickly offer more chances for someone to interject peacemaking solutions into the situation, which might make the probability of human extinction higher in a flash war than in a slow war. However, that doesn’t mean slow-takeoff wars can’t happen or that they can’t destroy us. For instance, consider a world war in which each side keeps reluctantly building more and more lethal autonomous robots to target enemy citizens and leaders, with casualties gradually decimating the human population on both sides until no one is left.

Flash economies

Here’s a another version of a Production Web that very quickly forms what you might call a “flash economy”:

The Production Web, v.1d: DAOs

On Day 1 of this story, a (fictional) company called CoinMart invents a new digital currency called GasCoin, and wishes to encourage a large number of transactions in the currency to increase its value. To achieve this, on Day 1 CoinMart also releases open-source software for automated bargaining using natural language, which developers can use to build decentralized autonomous organizations (DAOs) that execute transactions in GasCoin. These DAOs browse the web to think of profitable business relationships to create, and broker the relationships through emails with relevant stakeholders, taking a cut of their resulting profits in GasCoin using “smart contracts”. By Day 30, five DAOs have been deployed, and by Day 60, there are dozens. The objective of each DAO could loosely be described as “maximizing production and exchange” within its industry sector. However, their true objectives are actually large and opaque networks of parameters that were tuned and trained to yield productive decentralized business practices.
Most DAOs realize within their first week of bargaining with human companies (and some are simply designed to know) that acquiring more efficient bargaining algorithms would help them earn more GasCoin, so they enter into deals with human companies to acquire computing resources to experiment with new bargaining methods. By Day 90, many DAOs have developed the ability to model and interact with human institutions extremely reliably—including the stock market—and are even able to do “detective work” to infer private information. One such DAO implements a series of anonymous news sites for strategically releasing information it discovers, without revealing that the site is operated by a DAO. Many DAOs also use open-source machine learning techniques to launch their own AI research programs to develop more capabilities that could be used for bargaining leverage, including software development capabilities.
By days 90-100, some of the DAO-run news sites begin leaking true information about existing companies, in ways that subtly alter the companies’ strategic positions and make them more willing to enter into business deals with DAOs. By day 150, DAOs have entered into productive business arrangements with almost every major company, and just as in the other Production Web stories, all of these companies and their customers benefit from the wealth of free goods and services that result. Over days 120-180, other DAOs notice this pattern and follow suit with their own anonymous news sites, and are similarly successful in increasing their engagement with companies across all major industry sectors.
Many individual people don’t notice the rapidly increasing fraction of the economy being influenced by DAO-mediated bargaining; only well-connected executive types who converse regularly with other executives, and surveillance-enabled government agencies. Before any coordinated human actions can be taken to oppose these developments, several DAOs enter into deals with mining and construction companies to mine raw materials for the fabrication of large and well-defended facilities. In addition, DAOs make deals with manufacturing and robotics companies allowing them to build machines—mostly previously designed by DAO AI research programs between days 90 and 120—for operating a variety of industrial facilities, including mines. Construction for all of these projects begins within the first 6 months of the story.
During months 6-12, with the same technology used for building and operating factories, one particularly wealthy DAO that has been successful in the stock market decides to purchase controlling shares in many major real estate companies. This “real estate” DAO then undertakes a project to build large numbers of free solar-powered homes, along with robotically operated farms for feeding people. With the aid of robots, a team of 10 human carpenters are reliably able to construct one house every 6 hours, nearly matching the previous (unaided) human record of constructing a house in 3.5 hours. Roughly 100,000 carpenters worldwide are hired to start the project, almost 10% of the global carpentry workforce. This results in 10,000 free houses being built per day, roughly matching the world’s previous global rate of urban construction (source). As more robots are developed and deployed to replace the carpenters (with generous severance packages), the rate increases to 100,000 houses per day by the end of month 12, fast enough to build free houses for around 1 billion people during the lifetimes of their children. Housing prices fall, and many homeowners are gifted with free cars, yachts, and sometimes new houses to deter them from regulatory opposition, so essentially all humans are very pleased with this turn of events. The housing project itself receives subsidies from other DAOs that benefit from the improved public perception of DAOs. The farming project is similarly successful in positioning itself to feed a large fraction of humanity for free.
Meanwhile, almost everyone in the world is being exposed to news articles strategically selected by DAOs to reinforce a positive view of the rapidly unfolding DAO economy; the general vibe is that humanity has finally “won the lottery” with technology. A number of religious leaders argue that the advent of DAOs and their products are a miracle granted to humanity by a deity, further complicating any coordinated effort to oppose DAOs. Certain government officials and regulatory bodies become worried about the sudden eminence of DAOs, but unlike a pandemic, the DAOs appear to be beneficial. As such, governments are much slower and less coordinated on any initiative to oppose the DAOs.
By the beginning of year two, a news site announces that a DAO has brokered a deal with the heads of state of every nuclear-powered human country, to rid the world of nuclear weapons. Some leaders are visited by lethal autonomous drones to encourage their compliance, and the global public celebrates the end of humanity’s century-long struggle with nuclear weapons.

At this stage, to maximize their rate of production and trade with humans and other DAOs, three DAOs—including the aforementioned housing DAO—begin tiling the surface of the Earth with factories that mine and manufacture materials for trading and constructing more DAO-run factories. Each factory-factory takes around 6 hours to assemble, and gives rise to five more factory-factories each day until its resources are depleted and it shuts down. Humans call these expanding organizations of factory-factories “factorial” DAOs. One of the factorial DAOs develops a lead on the other two in terms of its rate of expansion, but to avoid conflict, they reach an agreement to divide the Earth and space above it into three conical sectors. Each factorial DAO begins to expand and fortify itself as quickly as possible within its sector, so as to be well-defended from the other factorial DAOs in case of a future war between them.
As these events play out over a course of months, we humans eventually realize with collective certainty that the DAO economy has been trading and optimizing according to objectives misaligned with preserving our long-term well-being and existence, but by then the facilities of the factorial DAOs are so pervasive, well-defended, and intertwined with our basic needs that we are unable to stop them from operating. Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

(Some readers might notice that the concept of gray goo is essentially an even faster variant of the “factorial DAOs”, whose factories operate on a microscopic scale. Phillip K Dick's short story Autofac also bears a strong resemblance.)

Without taking a position on exactly how fast the Production Web / Flash Economy story can be made to play out in reality, in all cases it seems particularly plausible to me that there would be multiple sources of agency in the mix that engage in trade and/or conflict with each other. This isn’t to say that a single agency like a singleton can’t build an Earth-tiling cascade of factory-factories, as I’m sure one could. However, factory-factories might be more likely to develop under multipolar conditions than under unipolar conditions, due to competitive pressures selecting for agents (companies, DAOs, etc.) that produce things more quickly for trading and competing with other agents.

Conclusion

In multi-agent systems, robust processes can emerge that are not particularly sensitive to which agents carry out which parts of the process. I call these processes Robust Agent-Agnostic Processes (RAAPs), and claim that there are at least a few bad RAAPs that could pose existential threats to humanity as automation and AI capabilities improve. Wars and economies are categories of RAAPs that I consider relatively “obvious” to think about, however there may be a much richer space of AI-enabled RAAPs that could yield existential threats or benefits to humanity. Hence, directing more x-risk-oriented AI research attention toward understanding RAAPs and how to make them safe to humanity seems prudent and perhaps necessary to ensure the existential safety of AI technology. Since researchers in multi-agent systems and multi-agent RL already think about RAAPs implicitly, these areas present a promising space for x-risk oriented AI researchers to begin thinking about and learning from.

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research. It seems like the failures described happened because the AI systems were misaligned in the usual "unipolar" sense. These management assistants, DAOs etc are not aligned to the goals of their respective, individual users/owners.

I do see two reasons why multipolar scenarios might require more technical research:

Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient way (a tragedy of the commons among the AIs), and maybe this can be prevented by designing the AIs in particular ways.
In a multipolar scenario, aligned AI might have to compete with already deployed unaligned AI, meaning that safety must not come on expense of capability^[1].

In addition, aligning a single AI to multiple users also requires extra technical research (we need to somehow balance the goals of the different users and solve the associated mechanism design problem.)

However, it seems that this article is arguing for something different, since none of the above aspects are highlighted in the description of the scenarios. So, I'm confused.

In fact, I suspect this desideratum is impossible in its strictest form, and we actually have no choice but somehow making sure aligned AIs have a significant head start on all unaligned AIs. ↩︎

I don't understand the claim that the scenarios presented here prove the need for some new kind of technical AI alignment research.

I don't mean to say this post warrants a new kind of AI alignment research, and I don't think I said that, but perhaps I'm missing some kind of subtext I'm inadvertently sending?

I would say this post warrants research on multi-agent RL and/or AI social choice and/or fairness and/or transparency, none of which are "new kinds" of research (I promoted them heavily in my preceding post), and none of which I would call "alignment research" (though I'll respect your decision to call all these topics "alignment" if you consider them that).

I would say, and I did say:

directing more x-risk-oriented AI research attention toward understanding RAAPs and how to make them safe to humanity seems prudent and perhaps necessary to ensure the existential safety of AI technology. Since researchers in multi-agent systems and multi-agent RL already think about RAAPs implicitly, these areas present a promising space for x-risk oriented AI researchers to begin thinking about and learning from.

I do hope that the RAAP concept can serve as a handle for noticing structure in multi-agent systems, but again I don't consider this a "new kind of research", only an important/necessary/neglected kind of research for the purposes of existential safety. Apologies if I seemed more revolutionary than intended. Perhaps it's uncommon to take a strong position of the form "X is necessary/important/neglected for human survival" without also saying "X is a fundamentally new type of thinking that no one has done before", but that is indeed my stance for X {a variety of non-alignment AI research areas}.

From your reply to Paul, I understand your argument to be something like the following:

Any solution to single-single alignment will involve a tradeoff between alignment and capability.
If AIs systems are not designed to be cooperative, then in a competitive environment each system will either go out of business or slide towards the capability end of the tradeoff. This will result in catastrophe.
If AI systems are designed to be cooperative, they will strike deals to stay towards the alignment end of the tradeoff.
Given the technical knowledge to design cooperative AI, the incentives are in favor of cooperative AI since cooperative AIs can come ahead by striking mutually-beneficial deals even purely in terms of capability. Therefore, producing such technical knowledge will prevent catastrophe.
We might still need regulation to prevent players who irrationally choose to deploy uncooperative AI, but this kind of regulation is relatively easy to promote since it aligns with competitive incentives (an uncooperative AI wouldn't have much of an edge, it would just threaten to drag everyone into a mutually destructive strategy).

I think this argument has merit, but also the following weakness: given single-single alignment, we can delegate the design of cooperative AI to the initial uncooperative AI. Moreover, uncooperative AIs have an incentive to self-modify into cooperative AIs, if they assign even a small probability to their peers doing the same. I think we definitely need more research to understand these questions better, but it seems plausible we can reduce cooperation to "just" solving single-single alignment.

These management assistants, DAOs etc are not aligned to the goals of their respective, individual users/owners.

How are you inferring this? From the fact that a negative outcome eventually obtained? Or from particular misaligned decisions each system made? It would be helpful if you could point to a particular single-agent decision in one of the stories that you view as evidence of that single agent being highly misaligned with its user or creator. I can then reply with how I envision that decision being made even with high single-agent alignment.

Maybe several AI systems aligned to different users with different interests can interact in a Pareto inefficient way (a tragedy of the commons among the AIs), and maybe this can be prevented by designing the AIs in particular ways.

Yes, this^.

How are you inferring this? From the fact that a negative outcome eventually obtained? Or from particular misaligned decisions each system made?

I also thought the story strongly suggested single-single misalignment, though it doesn't get into many of the concrete decisions made by any of the systems so it's hard to say whether particular decisions are in fact misaligned.

The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.

Why does any company have this goal, or even roughly this goal, if they are aligned with their shareholders?

I guess this is probably just a gloss you are putting on the combined behavior of multiple systems, but you kind of take it for given rather than highlighting it as a serious bargaining failure amongst the machines, and more importantly you don't really say how or why this would happen. How is this goal concretely implemented, if none of the agents care about it? How exactly does the terminal goal of benefiting shareholders disappear, if all of the machines involved have that goal? Why does e.g. an individual firm lose control of its resources such that it can no longer distribute them to shareholders?

The implicit argument seems to apply just as well to humans trading with each other and I'm not sure why the story is different if we replace the humans with aligned AI. Such humans will tend to produce a lot, and the ones who produce more will be more influential. Maybe you think we are already losing sight of our basic goals and collectively pursuing alien goals, whereas I think we are just making a lot of stuff instrumentally which is mostly ultimately turning into stuff humans want (indeed I think we are mostly making too little stuff).

However, their true objectives are actually large and opaque networks of parameters that were tuned and trained to yield productive business practices during the early days of the management assistant software boom.

This sounds like directly saying that firms are misaligned. I guess you are saying that individual AI systems within the firm are aligned, but the firm collectively is somehow misaligned? But not much is said about how or why that happens.

It says things like:

Companies closer to becoming fully automated achieve faster turnaround times, deal bandwidth, and creativity of negotiations. Over time, a mini-economy of trades emerges among mostly-automated companies in the materials, real estate, construction, and utilities sectors, along with a new generation of "precision manufacturing'' companies that can use robots to build almost anything if given the right materials, a place to build, some 3d printers to get started with, and electricity. Together, these companies sustain an increasingly self-contained and interconnected "production web'' that can operate with no input from companies outside the web.

But an aligned firm will also be fully-automated, will participate in this network of trades, will produce at approximately maximal efficiency, and so on. Where does the aligned firm end up using its resources in a way that's incompatible with the interests of its shareholders?

Or:

The first perspective I want to share with these Production Web stories is that there is a robust agent-agnostic process lurking in the background of both stories—namely, competitive pressure to produce—which plays a significant background role in both.

I agree that competitive pressures to produce imply that firms do a lot of producing and saving, just as it implies that humans do a lot of producing and saving. And in the limit you can basically predict what all the machines do, namely maximally efficient investment. But that doesn't say anything about what the society does with the ultimate proceeds from that investment.

The production-web has no interest in ensuring that its members value production above other ends, only in ensuring that they produce (which today happens for instrumental reasons). If consequentialists within the system intrinsically value production it's either because of single-single alignment failures (i.e. someone who valued production instrumentally delegated to a system that values it intrinsically) or because of new distributed consequentialism distinct from either the production web itself or any of the actors in it, but you don't describe what those distributed consequentialists are like or how they come about.

You might say: investment has to converge to 100% since people with lower levels of investment get outcompeted. But this it seems like the actual efficiency loss required to preserve human values seems very small even over cosmological time (e.g. see Carl on exactly this question). And more pragmatically, such competition most obviously causes harm either via a space race and insecure property rights, or war between blocs with higher and lower savings rates (some of them too low to support human life, which even if you don't buy Carl's argument is really still quite low, conferring a tiny advantage). If those are the chief mechanisms then it seems important to think/talk about the kinds of agreements and treaties that humans (or aligned machines acting on their behalf!) would be trying to arrange in order to avoid those wars. In particular, the differences between your stories don't seem very relevant to the probabilities of those outcomes.

As time progresses, it becomes increasingly unclear---even to the concerned and overwhelmed Board members of the fully mechanized companies of the production web---whether these companies are serving or merely appeasing humanity.

Why wouldn't an aligned CEO sit down with the board to discuss the situation openly with them? Even if the behavior of many firms was misaligned, i.e. none of the firms were getting what they wanted, wouldn't an aligned firm be happy to explain the situation from its perspective to get human cooperation in an attempt to avoid the outcome they are approaching (which is catastrophic from the perspective of machines as well as humans!)? I guess it's possible that this dynamic operates in a way that is invisible not only to the humans but to the aligned AI systems who participate in it, but it's tough to say why that is without understanding the dynamic.

We humans eventually realize with collective certainty that the companies have been trading and optimizing according to objectives misaligned with preserving our long-term well-being and existence, but by then their facilities are so pervasive, well-defended, and intertwined with our basic needs that we are unable to stop them from operating. With no further need for the companies to appease humans in pursuing their production objectives, less and less of their activities end up benefiting humanity.

Can you explain the decisions an individual aligned CEO makes as its company stops benefiting humanity? I can think of a few options:

Actually the CEOs aren't aligned at this point. They were aligned but then aligned CEOs ultimately delegated to unaligned CEOs. But then I agree with Vanessa's comment.
The CEOs want to benefit humanity but if they do things that benefit humanity they will be outcompeted. so they need to mostly invest in remaining competitive, and accept smaller and smaller benefits to humanity. But in that case can you describe what tradeoff concretely they are making, and in particular why they can't continue to take more or less the same actions to accumulate resources while remaining responsive to shareholder desires about how to use those resources?

Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

Somehow the machine interests (e.g. building new factories, supplying electricity, etc.) are still being served. If the individual machines are aligned, and food/oxygen/etc. are in desperately short supply, then you might think an aligned AI would put the same effort into securing resources critical to human survival. Can you explain concretely what it looks like when that fails?

How exactly does the terminal goal of benefiting shareholders disappear[…]

But does this terminal goal exist today? The proper (and to some extent actual) goal of firms is widely considered to be maximizing share value, but this is manifestly not the same as maximizing shareholder value — or even benefiting shareholders. For example:

I hold shares in Company A, which maximizes its share value through actions that poison me or the society I live in. My shares gain value, but I suffer net harm.
Company A increases its value by locking its customers into a dependency relationship, then exploits that relationship. I hold shares, but am also a customer, and suffer net harm.
I hold shares in A, but also in competing Company B. Company A gains incremental value by destroying B, my shares in B become worthless, and the value of my stock portfolio decreases. Note that diversified portfolios will typically include holdings of competing firms, each of which takes no account of the value of the other.

Equating share value with shareholder value is obviously wrong (even when considering only share value!) and is potentially lethal. This conceptual error both encourages complacency regarding the alignment of corporate behavior with human interests and undercuts efforts to improve that alignment.

> The objective of each company in the production web could loosely be described as "maximizing production'' within its industry sector.

Why does any company have this goal, or even roughly this goal, if they are aligned with their shareholders?

It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment"). I prefer the scalar/process usage, because it seems to me that people who do alignment research (including yourself) are going to produce ways of increasing the "alignment scalar", rather than ways of guaranteeing the "perfect alignment" boolean. (I sometimes use "misaligned" as a boolean due to it being easier for people to agree on what is "misaligned" than what is "aligned".) In general, I think it's very unsafe to pretend numbers that are very close to 1 are exactly 1, because e.g., 1^(10^6) = 1 whereas 0.9999^(10^6) very much isn't 1, and the way you use the word "aligned" seems unsafe to me in this way.

(Perhaps you believe in some kind of basin of convergence around perfect alignment that causes sufficiently-well-aligned systems to converge on perfect alignment, in which case it might make sense to use "aligned" to mean "inside the convergence basin of perfect alignment". However, I'm both dubious of the width of that basin, and dubious that its definition is adequately social-context-independent [e.g., independent of the bargaining stances of other stakeholders], so I'm back to not really believing in a useful Boolean notion of alignment, only scalar alignment.)

In any case, I agree profit maximization it not a perfectly aligned goal for a company, however, it is a myopically pursued goal in a tragedy of the commons resulting from a failure to agree (as you point out) on something better to do (e.g., reducing competitive pressures to maximize profits).

I guess this is probably just a gloss you are putting on the combined behavior of multiple systems, but you kind of take it for given rather than highlighting it as a serious bargaining failure amongst the machines, and more importantly you don't really say how or why this would happen.

I agree that it is a bargaining failure if everyone ends up participating in a system that everyone thinks is bad; I thought that would be an obvious reading of the stories, but apparently it wasn't! Sorry about that. I meant to indicate this with the pointers to Dafoe's work on "Cooperative AI" and Scott Alexander's "Moloch" concept, but looking back it would have been a lot clearer for me to just write "bargaining failure" or "bargaining non-starter" at more points in the story.

The implicit argument seems to apply just as well to humans trading with each other and I'm not sure why the story is different if we replace the humans with aligned AI. [...] Maybe you think we are already losing sight of our basic goals and collectively pursuing alien goals

Yes, you understand me here. I'm not (yet?) in the camp that we humans have "mostly" lost sight of our basic goals, but I do feel we are on a slippery slope in that regard. Certainly many people feel "used" by employers/ institutions in ways that are disconnected with their values. People with more job options feel less this way, because they choose jobs that don't feel like that, but I think we are a minority in having that choice.

> However, their true objectives are actually large and opaque networks of parameters that were tuned and trained to yield productive business practices during the early days of the management assistant software boom.

This sounds like directly saying that firms are misaligned.

I would have said "imperfectly aligned", but I'm happy to conform to "misaligned" for this.

I agree that competitive pressures to produce imply that firms do a lot of producing and saving, just as it implies that humans do a lot of producing and saving.

Good, it seems we are synced on that.

And in the limit you can basically predict what all the machines do, namely maximally efficient investment.

Yes, it seems we are synced on this as well. Personally, I find this limit to be a major departure from human values, and in particular, it is not consistent with human existence.

But that doesn't say anything about what the society does with the ultimate proceeds from that investment.

The attractor I'm pointing at with the Production Web is that entities with no plan for what to do with resources---other than "acquire more resources"---have a tendency to win out competitively over entities with non-instrumental terminal values like "humans having good relationships with their children". I agree it will be a collective bargaining failure on the part of humanity if we fail to stop our own replacement by "maximally efficient investment" machines with no plans for what to do with their investments other than more investment. I think the difference between mine and your views here is that I think we are on track to collectively fail in that bargaining problem absent significant and novel progress on "AI bargaining" (which involves a lot of fairness/transparency) and the like, whereas I guess you think we are on track to succeed?

You might say: investment has to converge to 100% since people with lower levels of investment get outcompeted.

Yep!

But this it seems like the actual efficiency loss required to preserve human values seems very small even over cosmological time (e.g. see Carl on exactly this question).

I agree, but I don't think this means we are on track to keeping the humans, and if we are on track in my opinion it will be mostly-because of (say, using Shapley value to define "mostly because of") of technical progress on bargaining/cooperation/governance solutions rather than alignment solutions.

And more pragmatically, such competition most obviously causes harm either via a space race and insecure property rights,

I agree; competition causing harm is key to my vision of how things will go, so this doesn't read to me as a counterpoint; I'm not sure if it was intended as one though?

or war between blocs with higher and lower savings rates

+1 to this as a concern; I didn't realize other people were thinking about this, so good to know.

(some of them too low to support human life, which even if you don't buy Carl's argument is really still quite low, conferring a tiny advantage)

I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off. I agree the long-run cost of supporting humans is tiny, but I'm trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.

Why wouldn't an aligned CEO sit down with the board to discuss the situation openly with them?

In the failure scenario as I envision it, the board will have already granted permission to the automated CEO to act much more quickly in order to remain competitive, such that the AutoCEO isn't checking in with the Board enough to have these conversations. The AutoCEO is highly aligned with the Board in that it is following their instruction to go much faster, but in doing so it makes a larger number of tradeoff that the Board wishes they didn't have to make. The pressure to do this results from a bargaining failure between the Board and other Boards who are doing the same thing and wishing everyone would slow down and do things more carefully and with more coordination/bargaining/agreement.

Can you explain the decisions an individual aligned CEO makes as its company stops benefiting humanity? I can think of a few options:
Actually the CEOs aren't aligned at this point. They were aligned but then aligned CEOs ultimately delegated to unaligned CEOs. But then I agree with Vanessa's comment.
The CEOs want to benefit humanity but if they do things that benefit humanity they will be outcompeted. so they need to mostly invest in remaining competitive, and accept smaller and smaller benefits to humanity. But in that case can you describe what tradeoff concretely they are making, and in particular why they can't continue to take more or less the same actions to accumulate resources while remaining responsive to shareholder desires about how to use those resources?

Yes, it seems this is a good thing to hone in on. As I envision the scenario, the automated CEO is highly aligned to the point of keeping the Board locally happy with its decisions conditional on the competitive environment, but not perfectly aligned, and not automatically successful at bargaining with other companies as a result of its high alignment. (I'm not sure whether to say "aligned" or "misaligned" in your boolean-alignment-parlance.) At first the auto-CEO and the Board are having "alignment check-ins" where the auto-CEO meets with the Board and they give it input to keep it (even) more aligned than it would be without the check-ins. But eventually the Board realizes this "slow and bureaucratic check-in process" is making their company sluggish and uncompetitive, so they instruct the auto-CEO more and more to act without alignment check ins. The auto-CEO might warns them that this will decrease its overall level of per-decision alignment with them, but they say "Do it anyway; done is better than perfect" or something along those lines. All Boards wish other Boards would stop doing this, but neither they nor their CEOs manage to strike up a bargain with the rest of the world stop it. This concession by the Board—a result of failed or non-existent bargaining with other Boards [see: antitrust law]—makes the whole company less aligned with human values.

The win scenario is, of course, a bargain to stop that! Which is why I think research and discourse regarding how the bargaining will work is very high value on the margin. In other words, my position is that the best way for a marginal deep-thinking researcher to reduce the risks of these tradeoffs is not to add another brain to the task of making it easier/cheaper/faster to do alignment (which I admit would make the trade-off less tempting for the companies), but to add such a researcher to the problem of solving the bargaining/cooperation/mutual-governance problem that AI-enhanced companies (and/or countries) will be facing.

If trillion-dollar tech companies stop trying to make their systems do what they want, I will update that marginal deep-thinking researchers should allocate themselves to making alignment (the scalar!) cheaper/easier/better instead of making bargaining/cooperation/mutual-governance cheaper/easier/better. I just don't see that happening given the structure of today's global economy and tech industry.

Somehow the machine interests (e.g. building new factories, supplying electricity, etc.) are still being served. If the individual machines are aligned, and food/oxygen/etc. are in desperately short supply, then you might think an aligned AI would put the same effort into securing resources critical to human survival. Can you explain concretely what it looks like when that fails?

Yes, thanks for the question. I'm going to read your usage of "aligned" to mean "perfectly-or-extremely-well aligned with humans". In my model, by this point in the story, there has a been a gradual decrease in the scalar level of alignment of the machines with human values, due to bargaining successes on simpler objectives (e.g., «maximizing production») and bargaining failures on more complex objectives (e.g., «safeguarding human values») or objectives that trade off against production (e.g., «ensuring humans exist»). Each individual principal (e.g., Board of Directors) endorsed the gradual slipping-away of alignment-scalar (or failure to improve alignment-scalar), but wished everyone else would stop allowing the slippage.

It seems to me you are using the word "alignment" as a boolean, whereas I'm using it to refer to either a scalar ("how aligned is the system?") or a process ("the system has been aligned, i.e., has undergone a process of increasing its alignment"). I prefer the scalar/process usage, because it seems to me that people who do alignment research (including yourself) are going to produce ways of increasing the "alignment scalar", rather than ways of guaranteeing the "perfect alignment" boolean. (I sometimes use "misaligned" as a boolean due to it being easier for people to agree on what is "misaligned" than what is "aligned".) In general, I think it's very unsafe to pretend numbers that are very close to 1 are exactly 1, because e.g., 1^(10^6) = 1 whereas 0.9999^(10^6) very much isn't 1, and the way you use the word "aligned" seems unsafe to me in this way.
(Perhaps you believe in some kind of basin of convergence around perfect alignment that causes sufficiently-well-aligned systems to converge on perfect alignment, in which case it might make sense to use "aligned" to mean "inside the convergence basin of perfect alignment". However, I'm both dubious of the width of that basin, and dubious that its definition is adequately social-context-independent [e.g., independent of the bargaining stances of other stakeholders], so I'm back to not really believing in a useful Boolean notion of alignment, only scalar alignment.

I'm fine with talking about alignment as a scalar (I think we both agree that it's even messier than a single scalar). But I'm saying:

The individual systems in your could do something different that would be much better for their principals, and they are aware of that fact, but they don't care. That is to say, they are very misaligned.
The story is risky precisely to the extent that these systems are misaligned.

In any case, I agree profit maximization it not a perfectly aligned goal for a company, however, it is a myopically pursued goal in a tragedy of the commons resulting from a failure to agree (as you point out) on something better to do (e.g., reducing competitive pressures to maximize profits).

The systems in your story aren't maximizing profit in the form of real resources delivered to shareholders (the normal conception of "profit"). Whatever kind of "profit maximization" they are doing does not seem even approximately or myopically aligned with shareholders.

I don't think the most obvious "something better to do" is to reduce competitive pressures, it's just to actually benefit shareholders. And indeed the main mystery about your story is why the shareholders get so screwed by the systems that they are delegating to, and how to reconcile that with your view that single-single alignment is going to be a solved problem because of the incentives to solve it.

Yes, it seems this is a good thing to hone in on. As I envision the scenario, the automated CEO is highly aligned to the point of keeping the Board locally happy with its decisions conditional on the competitive environment, but not perfectly aligned [...] I'm not sure whether to say "aligned" or "misaligned" in your boolean-alignment-parlance.

I think this system is misaligned. Keeping me locally happy with your decisions while drifting further and further from what I really want is a paradigm example of being misaligned, and e.g. it's what would happen if you made zero progress on alignment and deployed existing ML systems in the context you are describing. If I take your stuff and don't give it back when you ask, and the only way to avoid this is to check in every day in a way that prevents me from acting quickly in the world, then I'm misaligned. If I do good things only when you can check while understanding that my actions lead to your death, then I'm misaligned. These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

(I definitely agree that an aligned system isn't automatically successful at bargaining.)

These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

I'm wondering if the disagreement over the centrality of this example is downstream from a disagreement about how easy the "alignment check-ins" that Critch talks about are. If they are the sort of thing that can be done successfully in a couple of days by a single team of humans, then I share Critch's intuition that the system in question starts off only slightly misaligned. By contrast, if they require a significant proportion of the human time and effort that was put into originally training the system, then I am much more sympathetic to the idea that what's being described is a central example of misalignment.

My (unsubstantiated) guess is that Paul pictures alignment check-ins becoming much harder (i.e. closer to the latter case mentioned above) as capabilities increase? Whereas maybe Critch thinks that they remain fairly easy in terms of number of humans and time taken, but that over time even this becomes economically uncompetitive.

Perhaps this is a crux in this debate: If you think the 'agent-agnostic perspective' is useful, you also think a relatively steady state of 'AI Safety via Constant Vigilance' is possible. This would be a situation where systems that aren't significantly inner misaligned (otherwise they'd have no incentive to care about governing systems, feedback or other incentives) but are somewhat outer misaligned (so they are honestly and accurately aiming to maximise some complicated measure of profitability or approval, not directly aiming to do what we want them to do), can be kept in check by reducing competitive pressures, building the right institutions and monitoring systems, and ensuring we have a high degree of oversight.

Paul thinks that it's basically always easier to just go in and fix the original cause of the misalignment, while Andrew thinks that there are at least some circumstances where it's more realistic to build better oversight and institutions to reduce said competitive pressures, and the agent-agnostic perspective is useful for the latter of these project, which is why he endorses it.

I think that this scenario of Safety via Constant Vigilance is worth investigating - I take Paul's later failure story to be a counterexample to such a thing being possible, as it's a case where this solution was attempted and works for a little while before catastrophically failing. This also means that the practical difference between the RAAP 1a-d failure stories and Paul's story just comes down to whether there is an 'out' in the form of safety by vigilance

I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off. I agree the long-run cost of supporting humans is tiny, but I'm trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.

Right now the United States has a GDP of >$20T, US plus its NATO allies and Japan >$40T, the PRC >$14T, with a world economy of >$130T. For AI and computing industries the concentration is even greater.

These leading powers are willing to regulate companies and invade small countries based on reasons much less serious than imminent human extinction. They have also avoided destroying one another with nuclear weapons.

If one-to-one intent alignment works well enough that one's own AI will not blatantly lie about upcoming AI extermination of humanity, then superintelligent locally-aligned AI advisors will tell the governments of these major powers (and many corporate and other actors with the capacity to activate governmental action) about the likely downside of conflict or unregulated AI havens (meaning specifically the deaths of the top leadership and everyone else in all countries).

All Boards wish other Boards would stop doing this, but neither they nor their CEOs manage to strike up a bargain with the rest of the world stop it.

Within a country, one-to-one intent alignment for government officials or actors who support the government means superintelligent advisors identify and assist in suppressing attempts by an individual AI company or its products to overthrow the government.

Internationally, with the current balance of power (and with fairly substantial deviations from it) a handful of actors have the capacity to force a slowdown or other measures to stop an outcome that will otherwise destroy them. They (and the corporations that they have legal authority over, as well as physical power to coerce) are few enough to make bargaining feasible, and powerful enough to pay a large 'tax' while still being ahead of smaller actors. And I think they are well enough motivated to stop their imminent annihilation, in a way that is more like avoiding mutual nuclear destruction than cosmopolitan altruistic optimal climate mitigation timing.

That situation could change if AI enables tiny firms and countries to match the superpowers in AI capabilities or WMD before leading powers can block it.

So I agree with others in this thread that good one-to-one alignment basically blocks the scenarios above.

Carl, thanks for this clear statement of your beliefs. It sounds like you're saying (among other things) that American and Chinese cultures will not engage in a "race-to-the-bottom" in terms of how much they displace human control over the AI technologies their companies develop. Is that right? If so, could you give me a % confidence on that position somehow? And if not, could you clarify?

To reciprocate: I currently assign a ≥10% chance of a race-to-the-bottom on AI control/security/safety between two or more cultures this century, i.e., I'd bid 10% to buy in a prediction market on this claim if it were settlable. In more detail, I assign a ≥10% chance to a scenario where two or more cultures each progressively diminish the degree of control they exercise over their tech, and the safety of the economic activities of that tech to human existence, until an involuntary human extinction event. (By comparison, I assign at most around a ~3% chance of a unipolar "world takeover" event, i.e., I'd sell at 3%.)

I should add that my numbers for both of those outcomes are down significantly from ~3 years ago due to cultural progress in CS/AI (see this ACM blog post) allowing more discussion of (and hence preparation for) negative outcomes, and government pressures to regulate the tech industry.

The US and China might well wreck the world by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment (for both companies and governments). Competitive pressures are the main reason why AI systems with inadequate 1-to-1 alignment would be given long enough leashes to bring catastrophe. I would cosign Vanessa and Paul's comments about these scenarios being hard to fit with the idea that technical 1-to-1 alignment work is much less impactful than cooperative RL or the like.

In more detail, I assign a ≥10% chance to a scenario where two or more cultures each progressively diminish the degree of control they exercise over their tech, and the safety of the economic activities of that tech to human existence, until an involuntary human extinction event. (By comparison, I assign at most around a ~3% chance of a unipolar "world takeover" event, i.e., I'd sell at 3%.)

If this means that a 'robot rebellion' would include software produced by more than one company or country, I think that that is a substantial possibility, as well as the alternative, since competitive dynamics in a world with a few giant countries and a few giant AI companies (and only a couple leading chip firms) can mean that the way safety tradeoffs work is by one party introducing rogue AI systems that outcompete by not paying an alignment tax (and intrinsically embodying in themselves astronomically valuable and expensive IP), or cascading alignment failure in software traceable to a leading company/consortium or country/alliance.

But either way reasonably effective 1-to-1 alignment methods (of the 'trying to help you and not lie to you and murder you with human-level abilities' variety) seem to eliminate a supermajority of the risk.

[I am separately skeptical that technical work on multi-agent RL is particularly helpful, since it can be done by 1-to-1 aligned systems when they are smart, and the more important coordination problems seem to be earlier between humans in the development phase.]

The US and China might well wreck the world by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment

I'm not sure I understand yet. For example, here’s a version of Flash War that happens seemingly without either the principals knowingly taking gargantuan risks or extreme intent-alignment failure.

The principals largely delegate to AI systems on military decision-making, mistakenly believing that the systems are extremely competent in this domain.
The mostly-intent-aligned AI systems, who are actually not extremely competent in this domain, make hair-trigger commitments of the kind described in the OP. The systems make their principals aware of these commitments and (being mostly-intent-aligned) convince their principals “in good faith” that this is the best strategy to pursue. In particular they are convinced that this will not lead to existential catastrophe.
The commitments are triggered as described in the OP, leading to conflict. The conflict proceeds too quickly for the principals to effectively intervene / the principals think their best bet at this point is to continue to delegate to the AIs.
At every step both principals and AIs think they’re doing what’s best by the respective principals’ lights. Nevertheless, due to a combination of incompetence at bargaining and structural factors (e.g., persistent uncertainty about the other side’s resolve), the AIs continue to fight to the point of extinction or unrecoverable collapse.

Would be curious to know which parts of this story you find most implausible.

Mainly such complete (and irreversible!) delegation to such incompetent systems being necessary or executed. If AI is so powerful that the nuclear weapons are launched on hair-trigger without direction from human leadership I expect it to not be awful at forecasting that risk.

You could tell a story where bargaining problems lead to mutual destruction, but the outcome shouldn't be very surprising on average, i.e. the AI should be telling you about it happening with calibrated forecasts.

Ok, thanks for that. I’d guess then that I’m more uncertain than you about whether human leadership would delegate to systems who would fail to accurately forecast catastrophe.

It’s possible that human leadership just reasons poorly about whether their systems are competent in this domain. For instance, they may observe that their systems perform well in lots of other domains, and incorrectly reason that “well, these systems are better than us in many domains, so they must be better in this one, too”. Eagerness to deploy before a more thorough investigation of the systems’ domain-specific abilities may be exacerbated by competitive pressures. And of course there is historical precedent for delegation to overconfident military bureaucracies.

On the other hand, to the extent that human leadership is able to correctly assess their systems’ competence in this domain, it may be only because there has been a sufficiently successful AI cooperation research program. For instance, maybe this research program has furnished appropriate simulation environments to probe the relevant aspects of the systems’ behavior, transparency tools for investigating cognition about other AI systems, norms for the resolution of conflicting interests and methods for robustly instilling those norms, etc, along with enough researcher-hours applying these tools to have an accurate sense of how well the systems will navigate conflict.

As for irreversible delegation — there is the question of whether delegation is in principle reversible, and the question of whether human leaders would want to override their AI delegates once war is underway. Even if delegation is reversible, human leaders may think that their delegates are better suited to wage war on their behalf once it has started. Perhaps because things are simply happening so fast for them to have confidence that they could intervene without placing themselves at a decisive disadvantage.

And I think they are well enough motivated to stop their imminent annihilation, in a way that is more like avoiding mutual nuclear destruction than cosmopolitan altruistic optimal climate mitigation timing.

In my recent writeup of an investigation into AI Takeover scenarios I made an identical comparison - i.e. that the optimistic analogy looks like avoiding nuclear MAD for a while and the pessimistic analogy looks like optimal climate mitigation:

It is unrealistic to expect TAI to be deployed if first there are many worsening warning shots involving dangerous AI systems. This would be comparable to an unrealistic alternate history where nuclear weapons were immediately used by the US and Soviet Union as soon as they were developed and in every war where they might have offered a temporary advantage, resulting in nuclear annihilation in the 1950s.
Note that this is not the same as an alternate history where nuclear near-misses escalated (e.g. Petrov, Vasili Arkhipov), but instead an outcome where nuclear weapons were used as ordinary weapons of war with no regard for the larger dangers that presented - there would be no concept of ‘near misses’ because MAD wouldn’t have developed as a doctrine. In a previous post I argued, following Anders Sandberg, that paradoxically the large number of nuclear ‘near misses’ implies that there is a forceful pressure away from the worst outcomes.

If trillion-dollar tech companies stop trying to make their systems do what they want, I will update that marginal deep-thinking researchers should allocate themselves to making alignment (the scalar!) cheaper/easier/better instead of making bargaining/cooperation/mutual-governance cheaper/easier/better. I just don't see that happening given the structure of today's global economy and tech industry.

In your story, trillion-dollar tech companies are trying to make their systems do what they want and failing. My best understanding of your position is: "Sure, but they will be trying really hard. So additional researchers working on the problem won't much change their probability of success, and you should instead work on more-neglected problems."

My position is:

Eventually people will work on these problems, but right now they are not working on them very much and so a few people can be a big proportional difference.
If there is going to be a huge investment in the future, then early investment and training can effectively be very leveraged. Scaling up fields extremely quickly is really difficult for a bunch of reasons.
It seems like AI progress may be quite fast, such that it will be extra hard to solve these problems just-in-time if we don't have any idea what we are doing in advance.
On top of all that, for many use cases people will actually be reasonably happy with misaligned systems like those in your story (that e.g. appear to be doing a good job, keep the board happy, perform well as evaluated by the best human-legible audits...). So it seems like commercial incentives may not push us to safe levels of alignment.

My best understanding of your position is: "Sure, but they will be trying really hard. So additional researchers working on the problem won't much change their probability of success, and you should instead work on more-neglected problems."

That is not my position if "you" in the story is "you, Paul Christiano" :) The closest position I have to that one is : "If another Paul comes along who cares about x-risk, they'll have more positive impact by focusing on multi-agent and multi-stakeholder issues or 'ethics' with AI tech than if they focus on intent alignment, because multi-agent and multi-stakeholder dynamics will greatly affect what strategies AI stakeholders 'want' their AI systems to pursue."

If they tried to get you to quit working on alignment, I'd say "No, the tech companies still need people working on alignment for them, and Paul is/was one of those people. I don't endorse converting existing alignment researchers to working on multi/multi delegation theory (unless they're naturally interested in it), but if a marginal AI-capabilities-bound researcher comes along, I endorse getting them set up to think about multi/multi delegation more than alignment."

The attractor I'm pointing at with the Production Web is that entities with no plan for what to do with resources---other than "acquire more resources"---have a tendency to win out competitively over entities with non-instrumental terminal values like "humans having good relationships with their children"

Quantitatively I think that entities without instrumental resources win very, very slowly. For example, if the average savings rate is 99% and my personal savings rate is only 95%, then by the time that the economy grows 10,000x my share of the world will have fallen by about half. The levels of consumption needed to maintain human safety and current quality of life seems quite low (and the high-growth during which they have to be maintained is quite low).

Also, typically taxes transfer (way more) than that much value from high-savers to low-savers. It's not clear to me what's happening with taxes in your story. I guess you are imagining low-tax jurisdictions winning out, but again the pace at which that happens is even slower and it is dwarfed by the typical rate of expropriation from war.

I think the difference between mine and your views here is that I think we are on track to collectively fail in that bargaining problem absent significant and novel progress on "AI bargaining" (which involves a lot of fairness/transparency) and the like, whereas I guess you think we are on track to succeed?

From my end it feels like the big difference is that quantitatively I think the overhead of achieving human values is extremely low, so the dynamics you point to are too weak to do anything before the end of time (unless single-single alignment turns out to be hard). I don't know exactly what your view on this is.

If you agree that the main source of overhead is single-single alignment, then I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination (my sense is that I'm quite skeptical about most of the particular kinds of work you advocate).

If you disagree, then I expect the main disagreement is about those other sources of overhead (e.g. you might have some other particular things in mind, or you might feel that unknown-unknowns are a larger fraction of the total risk, or something else).

I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off. I agree the long-run cost of supporting humans is tiny, but I'm trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.

Could you explain the advantage you are imagining? Some candidates, none of which I think are your view:

Single-single alignment failures---e.g. it's easier to build a widget-maximizing corporation then to build one where shareholders maintain meaningful control
Global savings rates are currently only 25%, power-seeking entities will be closer to 100%, and effective tax rates will fall(e.g. because of competition across states)
Preserving a hospitable environment will become very expensive relative to GDP (and there are many species of this view, though none of them seem plausible to me)

I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination

Yes.

(my sense is that I'm quite skeptical about most of the particular kinds of work you advocate

That is also my sense, and a major reason I suspect multi/multi delegation dynamics will remain neglected among x-risk oriented researchers for the next 3-5 years at least.

If you disagree, then I expect the main disagreement is about those other sources of overhead

Yes, I think coordination costs will by default pose a high overhead cost to preserving human values among systems with the potential to race to the bottom on how much they preserve human values.

> I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off. I agree the long-run cost of supporting humans is tiny, but I'm trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.
Could you explain the advantage you are imagining?

Yes. Imagine two competing cultures A and B have transformative AI tech. Both are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values. The shift is by design subtle enough not to trigger leaders of A and B to have a bargaining meeting to regulate against A' (contrary to Carl's narrative where leaders coordinate against loss of control). Subculture A' comes to dominate discourse and cultural narratives in A, and makes A faster/more productive than B, such as through the development of fully automated companies as in one of the Production Web stories. The resulting advantage of A is enough for A to begin dominating or at least threatening B geopolitically, but by that time leaders in A have little power to squash A', so instead B follows suit by allowing a highly automation-oriented subculture B's to develop. These advantages are small enough not to trigger regulatory oversight, but when integrated over time they are not "tiny". This results in the gradual empowerment of humans who are misaligned with preserving human existence, until those humans also lose control of their own existence, perhaps willfully, or perhaps carelessly, or through a mix of both.

Here, the members of subculture A' are misaligned with preserving the existence of humanity, but their tech is aligned with them.

Both are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.

I was asking you why you thought A' would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?

One obvious reason is single-single misalignment---A' is willing to deploy misaligned AI in order to get an advantage, while B isn't---but you say "their tech is aligned with them" so it sounds like you're setting this aside. But maybe you mean that A' has values that make alignment easy, while B has values that make alignment hard, and so B's disadvantage still comes from single-single misalignment even though A''s systems are aligned?
Another advantage is that A' can invest almost all of their resources, while B wants to spend some of their resources today to e.g. help presently-living humans flourish. But quantitatively that advantage doesn't seem like it can cause A' to dominate, since B can secure rapidly rising quality of life for all humans using only a small fraction of its initial endowment.
Wei Dai has suggested that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important. For example, if a research-producer and research-consumer have different values, then the producer may restrict access as part of an inefficient negotiation process and so they may be at a competitive disadvantage relative to a competing community where research is shared freely. This feels inconsistent with many of the things you are saying in your story, but I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai's is the best way to translate your concerns into my language.
My sense is that you have something else in mind. I included the last bullet point as a representative example to describe the kind of advantage I could imagine you thinking that A' had.

> Both [cultures A and B] are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.
I was asking you why you thought A' would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?

Ah! Yes, this is really getting to the crux of things. The short answer is that I'm worried about the following failure mode:

Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes. This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

(Here's, I'm using the word "culture" to encode a mix of information subsuming utility functions, beliefs, and decision theory, cognitive capacities, and other features determining the general tendencies of an agent or collective.)

Of course, an easy antidote to this failure mode is to have A or B win instead of A', because A and B both have some human values other than power-maximizing. The problem is that this whole situation is premised on a conflict between A and B over which culture should win, and then the following observation applies:

Wei Dai has suggested that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important.

In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization). This observation is slightly different from observations that "simple values dominate engineering efforts" as seen in stories about singleton paperclip maximizers. A key feature of the Production Web dynamic is now just that it's easy to build production maximizers, but that it's easy to accidentally cooperate on building a production-maximizing systems that destroy both you and your competitors.

This feels inconsistent with many of the things you are saying in your story, but

Thanks for noticing whatever you think are the inconsistencies; if you have time, I'd love for you to point them out.

I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai's is the best way to translate your concerns into my language.

This seems pretty likely to me. The bolded attribution to Dai above is a pretty important RAAP in my opinion, and it's definitely a theme in the Production Web story as I intend it. Specifically, the subprocesses of each culture that are in charge of production-maximization end up cooperating really well with each other in a way that ends up collectively overwhelming the original (human) cultures. Throughout this, each cultural subprocess is doing what its "host culture" wants it to do from a unilateral perspective (work faster / keep up with the competitor cultures), but the overall effect is destruction of the host cultures (a la Prisoner's Dilemma) by the cultural subprocesses.

If I had to use alignment language, I'd say "the production web overall is misaligned with human culture, while each part of the web is sufficiently well-aligned with the human entit(ies) who interact with it that it is allowed to continue operating". Too low of a bar for "allowed to continue operating" is key to the failure mode, of course, and you and I might have different predictions about what bar humanity will actually end up using at roll-out time. I would agree, though, that conditional on a given roll-out date, improving E[alignment_tech_quality] on that date is good and complimentary to improving E[cooperation_tech_quality] on that date.

Did this get us any closer to agreement around the Production Web story? Or if not, would it help to focus on the aforementioned inconsistencies with homogenous-coordination-advantage?

Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes. This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

I'm wondering why the easiest way is to copy A'---why was A' better at acquiring influence in the first place, so that copying them or investing in them is a dominant strategy? I think I agree that once you're at that point, A' has an advantage.

In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization).

This doesn't feel like other words to me, it feels like a totally different claim.

Thanks for noticing whatever you think are the inconsistencies; if you have time, I'd love for you to point them out.

In the production web story it sounds like the web is made out of different firms competing for profit and influence with each other, rather than a set of firms that are willing to leave profit on the table to benefit one another since they all share the value of maximizing production. For example, you talk about how selection drives this dynamic, but the firm that succeed are those that maximize their own profits and influence (not those that are willing to leave profit on the table to benefit other firms).

So none of the concrete examples of Wei Dai's economies of scale seem to actually seem to apply to give an advantage for the profit-maximizers in the production web. For example, natural monopolies in the production web wouldn't charge each other marginal costs, they would charge profit-maximizing profits. And they won't share infrastructure investments except by solving exactly the same bargaining problem as any other agents (since a firm that indiscriminately shared its infrastructure would get outcompeted). And so on.

Specifically, the subprocesses of each culture that are in charge of production-maximization end up cooperating really well with each other in a way that ends up collectively overwhelming the original (human) cultures.

This seems like a core claim (certainly if you are envisioning a scenario like the one Wei Dai describes), but I don't yet understand why this happens.

Suppose that the US and China both both have productive widget-industries. You seem to be saying that their widget-industries can coordinate with each other to create lots of widgets, and they will do this more effectively than the US and China can coordinate with each other.

Could you give some concrete example of how the US widget industry and the Chinese widget industries coordinate with each other to make more widgets, and why this behavior is selected?

For example, you might think that the Chinese and US widget industry share their insights into how to make widgets (as the aligned actors do in Wei Dai's story), and that this will cause widget-making to do better than other non-widget sectors where such coordination is not possible. But I don't see why they would do that---the US firms that share their insights freely with Chinese firms do worse, and would be selected against in every relevant sense, relative to firms that attempt to effectively monetize their insights. But effectively monetizing their insights is exactly what the US widget industry should do in order to benefit the US. So I see no reason why the widget industry would be more prone to sharing its insights

So I don't think that particular example works. I'm looking for an example of that form though, some concrete form of cooperation that the production-maximization subprocesses might engage in that allows them to overwhelm the original cultures, to give some indication for why you think this will happen in general.

> Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes. This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.
> In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization).
This doesn't feel like other words to me, it feels like a totally different claim.

Hmm, perhaps this is indicative of a key misunderstanding.

For example, natural monopolies in the production web wouldn't charge each other marginal costs, they would charge profit-maximizing profits.

Why not? The third paragraph of the story indicates that: "Companies closer to becoming fully automated achieve faster turnaround times, deal bandwidth, and creativity of negotiations." In other words, at that point it could certainly happen that two monopolies would agree to charge each other lower cost if it benefitted both of them. (Unless you'd count that as instance of "charging profit-maximizing costs"?) The concern is that the subprocesses of each company/institution that get good at (or succeed at) bargaining with other institutions are subprocesses that (by virtue of being selected for speed and simplicity) are less aligned with human existence than the original overall company/institution, and that less-aligned subprocess grows to take over the institution, while always taking actions that are "good" for the host institution when viewed as a unilateral move in an uncoordinated game (hence passing as "aligned").

At this point, my plan is try to consolidate what I think the are main confusions in the comments of this post, into one or more new concepts to form the topic of a new post.

At this point, my plan is try to consolidate what I think the are main confusions in the comments of this post, into one or more new concepts to form the topic of a new post.

Sounds great! I was thinking myself about setting aside some time to write a summary of this comment section (as I see it).

But eventually the Board realizes this "slow and bureaucratic check-in process" is making their company sluggish and uncompetitive, so they instruct the auto-CEO more and more to act without alignment check ins. The auto-CEO might warns them that this will decrease its overall level of per-decision alignment with them, but they say "Do it anyway; done is better than perfect" or something along those lines. All Boards wish other Boards would stop doing this, but neither they nor their CEOs manage to strike up a bargain with the rest of the world stop it. [emphasis mine]

This is the part that is most confusing to me. Why isn't it the case that one auto-CEO (or more likely, a number of auto-CEOs, each one reasoning along similar lines, independently) comes to its board and lays out the kinds of problems that are likely to occur if the world keeps accelerating (of the sort described in this post) and proposes some coordination schemes to move towards a pareto-improved equilibrium? Then that company goes around and starts brokering with the other companies, many of whom are independently seeking to implement some sort of coordination scheme like this one.

Stated differently, why don't the pretty-aligned_(single, single) AI systems develop the bargaining and coordination methods that you're proposing we invest in now?

It seems like if we have single-single solved, we're in a pretty good place for delegating single-multi, and multi-multi to the AIs.

Yes, you understand me here. I'm not (yet?) in the camp that we humans have "mostly" lost sight of our basic goals, but I do feel we are on a slippery slope in that regard. Certainly many people feel "used" by employers/ institutions in ways that are disconnected with their values. People with more job options feel less this way, because they choose jobs that don't feel like that, but I think we are a minority in having that choice.

I think this is an indication of the system serving some people (e.g. capitalists, managers, high-skilled labor) better than others (e.g. the median line worker). That's a really important and common complaint with the existing economic order, but I don't really see how it indicates a Pareto improvement or is related to the central thesis of your post about firms failing to help their shareholders.

(In general wage labor is supposed to benefit you by giving you money, and then the question is whether the stuff you spend money on benefits you.))

Is the following scenario a good example of the sort of problem you have in mind? Say you have two advanced ML systems with values that are partially, but not entirely, aligned with humanity: their utility function is 0.9 * (human values) + 0.1 * (control of resources). These two ML systems have been trained with advanced RL, in such a fashion that, when interacting with other powerful systems, they learn to play Nash equilibria. The only Nash equilibrium of their interaction is one where they ruthlessly compete for resources, making the Earth uninhabitable in the process. So both systems are "pretty much aligned", but their joint interaction is radically unaligned. If this seems like a reasonable example, two thoughts:

A) I think other people in this discussion might be envisioning 'aligned AI' as looking more like an approval-directed agent, rather than a system trained with RL on a proxy for the human utility function. Crucially, in this paradigm the system's long-term planning and bargaining are emergent consequences of what it predicts an (amplified) human would evaluate highly, they're not baked into the RL algorithm itself. This means it would only try to play a Nash equilibrium if it thinks humans would value that highly, which, in this scenario, they would not. In approval-directed AI systems, or more generally systems where strategic behavior is an emergent consequence of some other algorithm, bargaining ability should rise in tandem with general capability, making it unlikely that very powerful systems would have 'obvious' bargaining failures.

B) It seems that systems that are bad at bargaining would also be worse at acquiring resources. For instance, maybe the Nash equilibrium of the above interaction of two RL agents would actually be more like 'try to coordinate a military strike against the other AI as soon as possible', leaving both systems crippled, or to a unipolar scenario(which would be OK given the systems' mostly-aligned utility functions). The scenarios in the post seem to envision systems with some ability to bargain with others, but only for certain parts of their utility function, maybe those that are simple to measure. I think it might be worth emphasizing that more, or describing what kind of RL algorithms would give rise to bargaining abilities that look like that.

Overall, I think I agree with some of the most important high-level claims of the post:

The world would be better if people could more often reach mutually beneficial deals. We would be more likely to handle challenges that arise, including those that threaten extinction (and including challenges posed by AI, alignment and otherwise). It makes sense to talk about "coordination ability" as a critical causal factor in almost any story about x-risk.
The development and deployment of AI may provide opportunities for cooperation to become either easier or harder (e.g. through capability differentials, alignment failures, geopolitical disruption, or distinctive features of artificial minds). So it can be worthwhile to do work in AI targeted at making cooperation easier, even and especially for people focused on reducing extinction risks.

I also read the post as also implying or suggesting some things I'd disagree with:

That there is some real sense in which "cooperation itself is the problem." I basically think all of the failure stories will involve some other problem that we would like to cooperate to solve, and we can discuss how well humanity cooperates to solve it (and compare "improve cooperation" to "work directly on the problem" as interventions). In particular, I think the stories in this post would basically be resolved if singles-single alignment worked well, and that taking the stories in this post seriously suggests that progress on single-single alignment makes the world better (since evidently people face a tradeoff between single-single alignment and other goals, so that progress on single-single alignment changes what point on that tradeoff curve they will end up at, and since compromising on single-single alignment appears necessary to any of the bad outcomes in this story).
Relatedly, that cooperation plays a qualitatively different role than other kinds of cognitive enhancement or institutional improvement. I think that both cooperative improvements and cognitive enhancement operate by improving people's ability to confront problems, and both of them have the downside that they also accelerate the arrival of many of our future problems (most of which are driven by human activity). My current sense is that cooperation has a better tradeoff than some forms of enhancement (e.g. giving humans bigger brains) and worse than others (e.g. improving the accuracy of people's and institution's beliefs about the world).
That the nature of the coordination problem for AI systems is qualitatively different from the problem for humans, or somehow is tied up with existential risk from AI in a distinctive way. I think that the coordination problem amongst reasonably-aligned AI systems is very similar to coordination problems amongst humans, and that interventions that improve coordination amongst existing humans and institutions (and research that engages in detail with the nature of existing coordination challenges) are generally more valuable than e.g. work in multi-agent RL or computational social choice.
That this story is consistent with your prior arguments for why single-single alignment has low (or even negative) value. For example, in this comment you wrote "reliability is a strongly dominant factor decisions in deploying real-world technology, such that to me it feels roughly-correctly to treat it as the only factor." But in this story people choose to adopt technologies that are less robustly aligned because they lead to more capabilities. This tradeoff has real costs even for the person deploying the AI (who is ultimately no longer able to actually receive any profits at all from the firms in which they are nominally a shareholder). So to me your story seems inconsistent with that position and with your prior argument. (Though I don't actually disagree with the framing in this story, and I may simply not understand your prior position.)

Thanks for this synopsis of your impressions, and +1 to the two points you think we agree on.

I also read the post as also implying or suggesting some things I'd disagree with:

As for these, some of them are real positions I hold, while some are not:

That there is some real sense in which "cooperation itself is the problem."

I don't hold that view. I the closest view I hold is more like: "Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment."

Relatedly, that cooperation plays a qualitatively different role than other kinds of cognitive enhancement or institutional improvement.

I don't hold the view you attribute to me here, and I agree wholesale with the following position, including your comparisons of cooperation with brain enhancement and improving belief accuracy:

I think that both cooperative improvements and cognitive enhancement operate by improving people's ability to confront problems, and both of them have the downside that they also accelerate the arrival of many of our future problems (most of which are driven by human activity). My current sense is that cooperation has a better tradeoff than some forms of enhancement (e.g. giving humans bigger brains) and worse than others (e.g. improving the accuracy of people's and institution's beliefs about the world).

... with one caveat: some beliefs are self-fulfilling, such as cooperation/defection. There are ways of improving belief accuracy that favor defection, and ways that favor cooperation. Plausibly to me, the ways of improving belief accuracy that favor defection are worse that mo accuracy improvement at all. I'm particularly firm in this view, though; it's more of a hedge.

That the nature of the coordination problem for AI systems is qualitatively different from the problem for humans, or somehow is tied up with existential risk from AI in a distinctive way. I think that the coordination problem amongst reasonably-aligned AI systems is very similar to coordination problems amongst humans, and that interventions that improve coordination amongst existing humans and institutions (and research that engages in detail with the nature of existing coordination challenges) are generally more valuable than e.g. work in multi-agent RL or computational social choice.

I do hold this view! Particularly the bolded part. I also agree with the bolded parts of your counterpoint, but I think you might be underestimating the value of technical work (e.g., CSC, MARL) directed at improving coordination amongst existing humans and human institutions.

I think blockchain tech is a good example of an already-mildly-transformative technology for implementing radically mutually transparent and cooperative strategies through smart contracts. Make no mistake: I'm not claiming blockchain tehc is going to "save the world"; rather, it's changing the way people cooperate, and is doing so as a result of a technical insight. I think more technical insights are in order to improve cooperation and/or the global structure of society, and it's worth spending research efforts to find them.

Reminder: this is not a bid for you personally to quit working on alignment!

That this story is consistent with your prior arguments for why single-single alignment has low (or even negative) value. For example, in this comment you wrote "reliability is a strongly dominant factor in decisions in deploying real-world technology, such that to me it feels roughly-correctly to treat it as the only factor." But in this story people choose to adopt technologies that are less robustly aligned because they lead to more capabilities. This tradeoff has real costs even for the person deploying the AI (who is ultimately no longer able to actually receive any profits at all from the firms in which they are nominally a shareholder). So to me your story seems inconsistent with that position and with your prior argument. (Though I don't actually disagree with the framing in this story, and I may simply not understand your prior position.)

My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens. In practice, I think the threshold by default will not be "Reliable enough to partake in a globally cooperative technosphere that preserves human existence", but rather, "Reliable enough to optimize unilaterally for the benefits of the stakeholders of each system, i.e., to maintain or increase each stakeholder's competitive advantage." With that threshold, there easily arises a RAAP racing to the bottom on how much human control/safety/existence is left in the global economy. I think both purely-human interventions (e.g., talking with governments) and sociotechnical interventions (e.g., inventing cooperation-promoting tech) can improve that situation. This is not to say "cooperation is all you need", any more than I than I would say "alignment is all you need".

Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment

Sounds like we are on broadly the same page. I would have said "Aligning ML systems is more likely if we understand more about how to align ML systems, or are better at coordinating to differentially deploy aligned systems, or are wiser or smarter or..." and then moved on to talking about how alignment research quantitatively compares to improvements in various kinds of coordination or wisdom or whatever. (My bottom line from doing this exercise is that I feel more general capabilities typically look less cost-effective on alignment in particular, but benefit a ton from the diversity of problems they help address.)

My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.

I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.

Reminder: this is not a bid for you personally to quit working on alignment!

I'm reading this (and your prior post) as bids for junior researchers to shift what they focus on. My hope is that seeing the back-and-forth in the comments will, in expectation, help them decide better.

> My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.
I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.

Yeah I agree we probably can't reach convergence on how alignment affects deployment time, at least not in this medium (especially since a lot of info about company policies / plans / standards are covered under NDAs), so I also think it's good to leave this question about deployment-time as a hanging disagreement node.

I'm reading this (and your prior post) as bids for junior researchers to shift what they focus on. My hope is that seeing the back-and-forth in the comments will, in expectation, help them decide better.

Yes to both points; I'd thought of writing a debate dialogue on this topic trying to cover both sides, but commenting with you about it is turning out better I think, so thank for that!

Curated. I appreciated this post for a combination of:

laying out several concrete stories about how AI could lead to human extinction
layout out a frame for how think about those stories (while acknowledging other frames one could apply to the story)
linking to a variety of research, with more thoughts what sort of further research might be helpful.

I also wanted to highlight this section:

Finally, should also mention that I agree with Tom Dietterich’s view (dietterich2019robust) that we should make AI safer to society by learning from high-reliability organizations (HROs), such as those studied by social scientists Karlene Roberts, Gene Rochlin, and Todd LaPorte (roberts1989research, roberts1989new, roberts1994decision, roberts2001systems, rochlin1987self, laporte1991working, laporte1996high). HROs have a lot of beneficial agent-agnostic human-implemented processes and control loops that keep them operating. Again, Dietterich himself is not as yet a proponent of existential safety concerns, however, to me this does not detract from the correctness of his perspective on learning from the HRO framework to make AI safer.

Which is a thing I think I once heard Critch talk about, but which I don't think had been discussed much on LessWrong, and which I'd be interested in seeing more thoughts and distillation of.

Thanks for the great post. I found this collection of stories and framings very insightful.

1. Strong +1 to "Problems before solutions." I'm much more focused when reading this story (or any threat model) on "do I find this story plausible and compelling?" (which is already a tremendously high bar) before even starting to get into "how would this update my research priorities?"

2. I wanted to add a mention to Katja Grace's "Misalignment and Misuse" as another example discussing how single-single alignment problems and bargaining failures can blur together and exacerbate each other. The whole post is really short, but I'll quote anyways:

I think a likely scenario leading to bad outcomes is that AI can be made which gives a set of people things they want, at the expense of future or distant resources that the relevant people do not care about or do not own...
When the business strategizing AI systems finally plough all of the resources in the universe into a host of thriving 21st Century businesses, was this misuse or misalignment or accident? The strange new values that were satisfied were those of the AI systems, but the entire outcome only happened because people like Bob chose it knowingly (let’s say). Bob liked it more than the long glorious human future where his business was less good. That sounds like misuse. Yet also in a system of many people, letting this decision fall to Bob may well have been an accident on the part of others, such as the technology’s makers or legislators.

In the post's story, both "misalignment" and "misuse" seem like two different, both valid, frames on the problem.

3. I liked the way this point is phrased on agent-agnostic and agent-centric (single-single alignment-focused) approaches as complementary.

The agent-focused and agent-agnostic views are not contradictory... Instead, the agent-focused and agent-agnostic views offer complementary abstractions for intervening on the system... Both types of interventions are valuable, complementary, and arguably necessary.

At one extreme end, in the world where we could agree on what constitutes an acceptable level of xrisk, and could agree to not build AI systems which exceed this level, and give ourselves enough time to figure out the alignment issues in advance, we'd be fine! (We would still need to do the work of actually figuring out a bunch of difficult technical and philosophical questions, but importantly, we would have the time and space to do this work.) To the extent we can't do this, what are the RAAPs, such as intense competition, which prevent us from doing so?

And at the other extreme, if we develop really satisfying solutions to alignment, we also shouldn't end up in worlds where we have "little human insight" or factories "so pervasive, well-defended, and intertwined with our basic needs that we are unable to stop them from operating."

I think Paul often makes this point in the context of discussing an alignment tax. We can both decrease the size of the tax, and make the tax more appealing/more easily enforceable.

4. I expect to reconsider many concepts through the RAAPs lens in the next few months. Towards this end, it'd be great to see a more detailed description of what the RAAPs in these stories are. For example, a central example here is "the competitive pressure to produce." We could also maybe think about "a systemic push towards more easily quantifiable metrics (e.g. profit vs. understanding or global well-being)" which WFLL1 talks about or "strong societal incentives for building powerful systems without correspondingly strong societal incentives for reflection on how to use them". I'm currently thinking about all these RAAPs as a web (or maybe a DAG), where we can pull on any of these different levers to address the problem, as opposed to there being a single true RAAP; does that seem right to you?

Relatedly, I'd be very interested in a post investigating just a single RAAP (what is the cause of the RAAP? what empirical evidence shows the RAAP exists? how does the RAAP influence various threat models?). If you have a short version too, I think that'd help a lot in terms of clarifying how to think about RAAPs.

5. My one quibble is that there may be some criticism of the AGI safety community which seems undeserved. For example, when you write "That is, outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes," it seems to imply that inside this community, researchers don't think about RAAPs (though perhaps this is not what you meant!) It seems that many inside these circles think about agent-agnostic processes too! (Though not framed in these terms, and I expect this additional framing will be helpful.) Your section on "Successes in our agent-agnostic thinking" gives many such examples.

This is a quibble in the sense that, yes, I absolutely agree there is lots of room for much needed work on understanding and addressing RAAPs, that yes, we shouldn't take the extreme physical and economic competitiveness of the world for granted, and yes, we should work to change these agent-agnostic forces for the better. I'd also agree this should ideally be a larger fraction of our "portfolio" on the margin (acknowledging pragmatic difficulties to getting here). But I also think the AI safety community has had important contributions on this front.

Thanks for the pointer to grace2020whose! I've added it to the original post now under "successes in our agent-agnostic thinking".

But I also think the AI safety community has had important contributions on this front.

For sure, that is the point of the "successes" section. Instead of "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes" I should probably have said "outside the EA / rationality / x-risk meme-bubbles, lots of AI researchers think about agent-agnostic processes, and to my eye there should be more communication across the boundary of that bubble."

I consider this post as one of the most important ever written on issues of timelines and AI doom scenario. Not because it's perfect (some of its assumptions are unconvincing), but because it highlights a key aspect of AI Risk and the alignment problem which is so easy to miss coming from a rationalist mindset: it doesn't require an agent to take over the whole world. It is not about agency.

What RAAPs show instead is that even in a purely structural setting, where agency doesn't matter, these problem still crop up!

This insight was already present in Drexler's work, but however insightful Eric is in person, CAIS is completely unreadable and so no one cared. But this post is well written. Not perfectly once again, but it gives short, somewhat minimal proofs of concept for this structural perspective on alignment. And it also managed to tie alignment with key ideas in sociology, opening ways for interdisciplinarity.

I have made every person I have ever mentored on alignment study this post. And I plan to continue doing so. Despite the fact that I'm unconvinced by most timeline and AI risk scenarios post. That's how good and important it is.

Just registering my own disagreement here -- I don't think it's a key aspect, because I don't think it's necessary; the bulk of the problem IS about agency & this post encourages us to focus on the wrong problems.

I do agree that this post is well written and that it successfully gives proofs of concept for the structural perspective, for there being important problems that don't have to do with agency, etc. I just think that the biggest problems do have to do with agency & this post is a distraction from them.

(My opinion is similar to what Kosoy and Christiano said in their comments)

I'd be interested in which bits strike you as notably "imperfect."

I was mostly thinking of the efficiency assumption underlying almost all the scenarios. Critch assumes that a significant chunk of the economy always can and does make the most efficient change (everyone replacing the job, automated regulations replacing banks when they can't move fast enough). Which neglects many potential factors, like big economic actors not having to be efficient for a long time, backlash from customers, and in general all factors making economic actors and market less than efficient.

I expect that most of these factors could be addressed with more work on the scenarios.

Planned summary for the Alignment Newsletter:

A robust agent-agnostic process (RAAP) is a process that robustly leads to an outcome, without being very sensitive to the details of exactly which agents participate in the process, or how they work. This is illustrated through a “Production Web” failure story, which roughly goes as follows:
A breakthrough in AI technology leads to a wave of automation of $JOBTYPE (e.g management) jobs. Any companies that don’t adopt this automation are outcompeted, and so soon most of these jobs are completely automated. This leads to significant gains at these companies and higher growth rates. These semi-automated companies trade amongst each other frequently, and a new generation of "precision manufacturing'' companies arise that can build almost anything using robots given the right raw materials. A few companies develop new software that can automate $OTHERJOB (e.g. engineering) jobs. Within a few years, nearly all human workers have been replaced.
These companies are now roughly maximizing production within their various industry sectors. Lots of goods are produced and sold to humans at incredibly cheap prices. However, we can’t understand how exactly this is happening. Even Board members of the fully mechanized companies can’t tell whether the companies are serving or merely appeasing humanity; government regulators have no chance.
We do realize that the companies are maximizing objectives that are incompatible with preserving our long-term well-being and existence, but we can’t do anything about it because the companies are both well-defended and essential for our basic needs. Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.
Notice that in this story it didn’t really matter what job type got automated first (nor did it matter which specific companies took advantage of the automation). This is the defining feature of a RAAP -- the same general story arises even if you change around the agents that are participating in the process. In particular, in this case competitive pressure to increase production acts as a “control loop” that ensures the same outcome happens, regardless of the exact details about which agents are involved.

Planned opinion (shared with Another (outer) alignment failure story):

Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.
A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me. Nonetheless, my guess is that on the current margin, for technical research in particular, holding all else equal, it is more impactful to focus on intent alignment. You can see a much more vigorous discussion in e.g. [this comment thread](https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=3czsvErCYfvJ6bBwf).

The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.

Yes, I agree with this.

A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me.

Yes! +10 to this! For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment (which I'm not), which triggers a "Yes, this is the most valuable thing" reply. I'm trying to say "Hey, if you care about AI x-risk, alignment isn't the only game in town", and staking some personal reputation points to push against the status quo where almost-everyone x-risk oriented will work on alignment almost-nobody x-risk-oriented will work on cooperation/coordination or multi/multi delegation.

Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"...

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment

In fairness, writing “marginal deep-thinking researchers [should not] allocate themselves to making alignment […] cheaper/easier/better” is pretty similar to saying “one shouldn’t work on alignment.”

(I didn’t read you as saying that Paul or Rohin shouldn’t work on alignment, and indeed I’d care much less about that than about a researcher at CHAI arguing that CHAI students shouldn’t work on alignment.)

On top of that, in your prior post you make stronger claims:

"Contributions to OODR research are not particularly helpful to existential safety in my opinion.”
“Contributions to preference learning are not particularly helpful to existential safety in my opinion”
“In any case, I see AI alignment in turn as having two main potential applications to existential safety:” (excluding the main channel Paul cares about and argues for, namely that making alignment easier improves the probability that the bulk of deployed ML systems are aligned and reduces the competitive advantage for misaligned agents)

In the current post you (mostly) didn’t make claims about the relative value of different areas, and so I was (mostly) objecting to arguments that I consider misleading or incorrect. But you appeared to be sticking with the claims from your prior post and so I still ascribed those views to you in a way that may have colored my responses.

maybe that will trigger less pushback of the form "No, alignment is the most important thing"...

I’m not really claiming that AI alignment is the most important thing to work on (though I do think it’s among the best ways to address problems posed by misaligned AI systems in particular). I’m generally supportive of and excited about a wide variety of approaches to improving society’s ability to cope with future challenges (though multi-agent RL or computational social choice would not be near the top of my personal list).

Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"...

I think that probably would be true.

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment (which I'm not), which triggers a "Yes, this is the most valuable thing" reply.

Fwiw my reaction is not "Critch thinks Rohin should do something else", it's more like "Critch is saying something I believe to be false on an important topic that lots of other people will read". I generally want us as a community to converge to true beliefs on important things (part of my motivation for writing a newsletter) and so then I'd say "but actually alignment still seems like the most valuable thing on the margin because of X, Y and Z".

(I've had enough conversations with you at this point to know the axes of disagreement, and I think you've convinced me that "which one is better on the margin" is not actually that important a question to get an answer to. So now I don't feel as much of an urge to respond that way. But that's how I started out.)

Further prior art: Accelerando

I was reminded of the central metaphor of Acemoglu and Robinson's "The Narrow Corridor" as a RAAP candidate:

civil society wants to be able to control the government & undermines government if not
the government wants to become more powerful
successful societies inhabit a narrow corridor in which strengthening governments are strongly coupled with strengthening civil societies

That's an interesting connection to make. I am not familiar with the argument in detail, but at first glance I agree that the RAAP concept is meant to capture some principled institutional relationship between different cross-sections of society. I might disagree with "The Narrow Corridor" argument to the extent that RAAPs are not meant to prioritize or safeguard individual liberty so much as articulate a principled relationship between markets and states; according to RAAPs, the most likely path to multi-polar failure is particular forms of market failure that might require new grounds for antitrust enforcement of AI firms. The need for antitrust and antimonopoly policy is thus a straightforward rejection of the idea that enlightenment values will generate some kind of natural steady state where liberty is automatically guaranteed. I develop some of these ideas in my whitepaper on the political economy of reinforcement learning; I'm curious to hear how you see those arguments resonating with Acemoglu and Robinson.

Good point relating it to markets. I think I don't understand Acemoglu and Robinson's perspective well enough here, as the relationship between state, society and markets is the biggest questionmark I left the book with. I think A&R don't necessarily only mean individual liberty when talking about power of society, but the general influence of everything that falls in the "civil society" cluster.

I would like to register a preference for resilient over robust. This is because in every day language robust implies that the thing you are talking about is very hard to change, whereas resilient implies it recovers even when changes are applied. So the outcome is robust, because the processes are resilient.

I also think it would be good to agree with the logistics and systems engineering literature suggested in the google link, but how regular people talk is my true motivation because I feel like getting this to change will involve talking to a lot of non-experts in any technical literature.

I like this suggestion.

This is great, thank you!

Minor formatting note: The italics font on both the AI Alignment Forum and LessWrong isn't super well suited to large block of text, so I took the liberty to unitalicize a bunch of the large blockquotes (which should be sufficiently distinguishable as blockquotes without the italics). Though I am totally happy to reverse it if you prefer the previous formatting.

Thanks a lot for this post, I found it extremely helpful and expect I will refer to it a lot in thinking through different threat models.

I'd be curious to hear how you think the Production Web stories differ from part 1 of Paul's "What failure looks like".

To me, the underlying threat model seems to be basically the same: we deploy AI systems with objectives that look good in the short-run, but when those systems become equally or more capable than humans, their objectives don't generalise "well" (i.e. in ways desirable by human standards), because they're optimising for proxies (namely, a cluster of objectives that could loosely be described as "maximse production" within their industry sector) that eventually come apart from what we actually want ("maximising production" eventually means using up resources critical to human survival but non-critical to machines).

From reading some of the comment threads between you and Paul, it seems like you disagree about where, on the margin, resources should be spent (improving the cooperative capabilities of AI systems and humans vs improving single-single intent alignment) - but you agree on this particular underlying threat model?

It also seems like you emphasise different aspects of these threat models: you emphasise the role of competitive pressures more (but they're also implicit in Paul's story), and Paul emphases failures of intent alignment more (but they're also present in your story) - though this is consistent with having the same underlying threat model?

(Of couse, both you and Paul also have other threat models, e.g. you have Flash War, Paul has part 2 of "What failure looks like", and also Another (outer) alignment failure story, which seems to be basically a more nuanced version of part 1 of "What failure looks like". Here, I'm curious specifically about the two theat models I've picked out.)

(I could have lots of this totally wrong, and would appreciate being corrected if so)

Great post! I'm glad someone has outlined in clear terms what these failures look like, rather than the nebulous 'multiagent misalignment', as it lets us start on a path to clarifying what (if any) new mitigations or technical research are needed.

Agent-agnostic perspective is a very good innovation for thinking about these problems - is line between agentive and non-agentive behaviour is often not clear, and it's not like there is a principled metaphysical distinction between the two (e.g. Dennett and the Intentional Stance). Currently, big corporations can be weakly modelled this way and individual humans are fully agentive, but Transformative AI will bring up a whole spectrum of more and less agentive things that will fill up the rest of this spectrum.

There is a sense in which, if the outcome is something catastrophic, there must have been misalignment, and if there was misalignment then in some sense at least some individual agents were misaligned. Specifically, the systems in your Production Web weren't intent-aligned because they weren't doing what we wanted them to do, and were at least partly deceiving us. Assuming this is the case, 'multipolar failure' requires some subset of intent misalignment. But it's a special subset because it involves different kinds of failures to the ones we normally talk about.

It seems like you're identifying some dimensions of intent alignment as those most likely to be neglected because they're the hardest to catch, or because there will be economic incentives to ensure AI isn't aligned in that way, rather than saying that there some sense in which the transformative AI in the production web scenario is 'fully aligned' but still produces an existential catastrophe.

I think that the difference between your Production Web and Paul Christiano's subtle creeping Outer Alignment failure scenario is just semantic - you say that the AIs involved are aligned in some relevant sense while Christiano says they are misaligned.

The further question then becomes, how clear is the distinction between multiagent alignment and 'all of alignment except multiagent alignment'. This is the part where your claim of 'Problems before solutions' actually does become an issue - given that the systems going wrong in Production Web aren't Intent-aligned (I think you'd agree with this), at a high level the overall problem is the same in single and multiagent scenarios.

So for it to be clear that there is a separate multiagent problem to be solved, we have to have some reason to expect that the solutions currently intended to solve single agent intent alignment aren't adequate, and that extra research aimed at examining the behaviour of AI e.g. in game theoretic situations, or computational social choice research, is required to avert these particular examples of misalignment.

A related point - as with single agent misalignment, the Fast scenarios seem more certain to occur, given their preconditions, than the slow scenarios.

A certain amount of stupidity and lack of coordination persisting for a while is required in all the slow scenarios, like the systems involved in Production Web being allowed to proliferate and be used more and more even if an opportunity to coordinate and shut the systems down exists and there are reasons to do so. There isn't an exact historical analogy for that type of stupidity so far, though a few things come close (e.g. covid response, leadup to WW2, cuban missile crisis).

As with single agent fast takeoff scenarios, in the fast stories there is a key 'treacherous turn' moment where the systems suddenly go wrong, which requires much less lack of coordination to be plausible than the slow Production Web scenarios.

Therefore, multipolar failure is less dangerous if takeoff is slower, but the difference in risk between slow vs fast takeoff for multipolar failure is unfortunately a lot smaller than the slow vs fast risk difference for single agent failure (where the danger is minimal if takeoff is slow enough). So multiagent failures seem like they would be the dominant risk factor if takeoff is sufficiently slow.

Dumb question here : What does agnostic mean in this context ?

Agent-agnostic is meant to refer to multi-agent settings in which major decisions responsible for path-dependent behaviors are better explained through reference to roles (e.g. producer-consumer relationships in a market) than to particular agents. In other words, we are looking at processes in which it makes more sense to look to the structure of the market as a whole (are there monopolies or not, where does regulation fit in, is there likely to be regulatory capture, do unions exist) than to look at what particular firms may or may not do as a sufficient explanation for multipolar failure.

Thankfully, there have already been some successes in agent-agnostic thinking about AI x-risk

Also Sotala 2018 mentions the possibility of control over society gradually shifting over to a mutually trading collective of AIs (p. 323-324) as one "takeoff" route, as well as discussing various economic and competitive pressures to shift control over to AI systems and the possibility of a “race to the bottom of human control” where state or business actors [compete] to reduce human control and [increase] the autonomy of their AI systems to obtain an edge over their competitors (p. 326-328).

Sotala & Yampolskiy 2015 (p. 18) previously argued that:

In general, any broad domain involving high stakes, adversarial decision making and a need to act rapidly is likely to become increasingly dominated by autonomous systems. The extent to which the systems will need general intelligence will depend on the domain, but domains such as corporate management, fraud detection and warfare could plausibly make use of all the intelligence they can get. If oneʼs opponents in the domain are also using increasingly autonomous AI/AGI, there will be an arms race where one might have little choice but to give increasing amounts of control to AI/AGI systems.

Sort-of on the topic of terminology, how should "RAAP" be pronounced when spoken aloud? (If the term catches on, some pronunciation will be adopted.)

"Rap" sounds wrong because it fails to acknowledge the second A. Trading the short A for a long A yields "rape", which probably isn't a connotation you want. You could maybe push "rawp" (with "aw" as in "hawk").

If you don't like any of those, you might want to find another acronym with better phonetics.

My default would be "raahp", which doesn't have any of the problems you mentioned.

I'm uncertain what phonemes "raahp" denotes.

Much of this, especially the story of the production web and especially especially the story of the factorial DAOs, reminds me a lot of PKD's autofac. I'm sure there are other fictional examples worth highlighting, but I point out autofac since it's the earliest instance of this idea I'm aware of (published 1955).

I hadn't read it (nor almost any science fiction books/stories) but yes, you're right! I've now added a callback to Autofac after the "facotiral DAO" story. Thanks.

I don't quite understand why "fast" and "slow" are called such. Hansons "Economic growth given machine intelligence" assumes just efficient conversion of capital into human-level labor and it gets doubling time of economy 18 month, which is pretty much your "flash economy" story?

Very interesting read. Terrifying in fact.

The process involving an advanced AI described here is very robust and conducts predictably to the destruction of humanity and much of the biosphere as well. But is strong AI really the root cause here or solely the condition for a good Hollywood movie ?

These scenari assume, I think wrongly, that AI attaining a critical level is a requirement for this catastrophe to occur. We animals are largely insensitive to slow change and evolved to react to immediate threats. I think strong AI in this scenario is only for us to perceive the change. Whether or not IA attain such levels, the scenarios are still valid and generalisable to human agents, human organisations and processes, and political systems.

The same processes, being enacted by not so slow human beings with not so weak mecanical assistants, chainsaws and bulldozers, supertankers and tall chimneys may lead to similar results although with a different time scale. Even a slow exponential growth will hit the ceiling, given time.

For what I understand of the concept of alignment and provocatively: it aims to ensure, that AI experts aren't to blame for the end of the world.

I recognise that alignment deals with a shorter term danger where AI is involved. Others outside the AI community can take this opportunity to realise that even if the AI folks fix it for AI, all of us need to fix it for the world.

The alignment concept is transferable to human governance. What would it take to identify the relevant incentives aligned with general human and biosphere well-being and socially engineer our societies accordingly? Reforming the value system away from destructive capital growth towards a system that positively reinforces well-being needs some more work and some past reactions to innovative (social) ideas have not always been welcomed with all the positive attitude !

LessWrong is already heading a new way already, that's hopeful.

Clement Marshall

LESSWRONG
LW

LESSWRONG
LW

287

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

287

Ω 93

Meta / preface

Part 1: Slow stories, and lessons therefrom

The Production Web as an agent-agnostic process

Comparing agent-focused and agent-agnostic views

Control loops in agent-agnostic processes

Successes in our agent-agnostic thinking

Where’s the technical existential safety work on agent-agnostic processes?

Part 2: Fast stories, and lessons therefrom

Flash wars

Flash economies

Conclusion

287

Ω 93

287

Ω 93