Introduction

Among those thinking that AI is an existential risk, there seems to be significant disagreement on what the main threat model is. Threat model uncertainty makes it harder to reduce this risk: a faulty threat model used by an important actor will likely lead to suboptimal decision making. This is why I think there should be more discussion on what the probability is for each existential risk threat model, rather than merely how much one’s accumulated p(doom) is.

There are multiple threat model overviews, such as those by Kaj SotalaSamuel Martin, and Richard Ngo. Three AI existential risk threat models seem to emerge as particularly popular:

  1. AI takeover (championed by Yudkowsky/Bostrom, common among rationalists).
  2. What Failure Looks Like (12) (authored by Paul Christiano, common in EA).
  3. Bad actor risk (influential in policymaking).

This post will focus on What Failure Looks Like. I will argue that this threat model is very unlikely (<0.1%) to lead to an existential event. If true, this would mean more emphasis by those worrying about existential risk should be placed on the other threat models, such as various AI takeover scenarios and bad actor risk.

Also, I argue that decreasing the risks (both existential and nonexistential) caused by the What Failure Looks Like threat model should not be done by AI alignment or by an AI pause, but rather by traditional AI regulation after development of the technology and at the point of application. The EU AI Act with its tiered approach can serve as a model. Note that this does not imply that I think alignment and a pause are not important anymore: I think they could be crucial to counter a Yudkowsky/Bostrom-style AI takeover, which I see as the most likely existential risk.

Why I think the existential risk due to What Failure Looks Like is very low

In What Failure Looks Like (2019), Paul Christiano introduces his two threat models towards ‘going out’. It is not clear to me whether this means an existential event, defined by Toby Ord and others as human extinction, a permanent dystopia, or an unrecoverable collapse, where the last two options should be stable over billions of years. However, even if Christiano himself would not mean such an existential event, others seem to interpret his threat model in this way, for example in this paper by Richard Ngo: “However, many other alignment researchers are primarily concerned about more gradual erosion of human control driven by the former threat model, and involving millions or billions of copies of AGIs deployed across society [Christiano, 2019a,b, Karnofsky, 2022]. Regardless of how it happens, though, misaligned AGIs gaining control over these key levers of power would be an existential threat to humanity.” Since What Failure Looks Like is thus seen as a significant existential threat model by leading researchers such as Ngo, it would be highly relevant if this were not the case, as I argue.

The two threat models of What Failure Looks Like are:

  1. Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. ("Going out with a whimper.")
  2. Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. ("Going out with a bang.")

I think both threat models combined have less than 0.1% chance to lead to an existential event.

Both threat models involve many AIs. In both threat models, there does not seem to be a deliberate AI takeover (e.g. caused by a resource conflict), either unipolar or multipolar. Rather, the danger is, according to this model, that things are ‘breaking’, rather than ‘taking’. The existential event would be accidental, not on purpose.

I think things breaking or going off the rails without a clear goal (random, uncoordinated events) are very unlikely to lead to human extinction. For example, there are still isolated human tribes in places like the Amazons or remote islands which barely have contact with the rest of the world, and do not depend on it for anything required for survival. Extinction of such people only caused by technology breaking seems impossible. Furthermore, billions of people currently do not interact heavily with technology such as AI, and although this may change in the future, there are many fallback systems that could make sure essential infrastructure such as agriculture and transportation is still working (I think increasing such low-tech, low-coordination fallback systems, which could make our society more robust in a hypothetical AI-soaked future, is useful risk-reducing work). Extinction is an extremely high bar and AI systems failing or heading off in random, uncoordinated directions, even in a future society soaked in AI, does not meet that bar.

Societal breakdown is used a lot as a term, but the bar for such an event to become existential, that is, stable over billions of years, is only slightly lower. For example, the collapse of the Roman empire would not meet this bar, since we recovered from it in roughly thousand years. Even worse collapses might be recoverable too: a quote from Anders Sandberg coming to mind is “hunter-gatherer societies are probably unstable”. For similar reasons as mentioned above, many AI systems that are failing or heading off in uncoordinated, random directions are extremely unlikely to result in such an unrecoverable collapse. A permanent dystopia requires lots of coordination and well-working systems (AI or human) to achieve, and seems therefore also extremely unlikely to be achieved by many AIs failing.

I think there are four specific reasons why the What Failure Looks Like threat model is highly unlikely to lead to an existential event, namely:

  1. Each AI is far from takeover capability by itself.
  2. Each AI runs only a small part of the world (likely <1%).
  3. AIs will not have exactly the same goals and will mostly not want to coordinate.
  4. AIs will go into production at different times (likely years apart).

I will defend below why I think the AIs will have these properties in both the WFLL scenarios, and why the combination of these properties makes it very unlikely that such a scenario will result in an existential event.

Each AI is far from takeover capability by itself

If many AI systems, such as envisaged in the WFLL threat model, all would have the capability to take over individually, I think the first slight misalignment (e.g. a resource conflict) would trigger a successful takeover attempt. If that many AIs have takeover capability, I think it is highly unlikely that we live long enough for the WFLL scenarios to occur. Therefore, I assume WFLL is happening in a world where the AIs are relatively far from individual takeover capability.

Each AI runs only a small part of the world (likely <1%)

Since each AI is below takeover capability, more important than how powerful the AI itself is, is how powerful the position is where it is employed. Of course, an AI effectively in charge of the most powerful military alliance could take over trivially. Again, therefore, in order for a WFLL-type scenario to occur, it has to be assumed that each AI only runs a small part of the world, since otherwise we won’t get this far.

AIs will not have exactly the same goals and will mostly not want to coordinate

In a world with many AIs, they will be employed by different people and institutes, for different purposes, and will therefore have different goals. One AI might be employed by an American advertising company, the next by the Chinese military. If they can work together to achieve their goals, they might choose to do so (in a similar way as humans may choose to work together), but they will often work against each other since they have different goals. If they are influence-seeking, such as Christiano writes in WFLL part 2, they will seek to increase their own influence. There is no reason, though, why AI1 would seek to increase AI2’s influence, except if it sees an advantage for itself as well. Therefore, there will not be a correlated failure, as Christiano writes, but only separate failures, at different times and in different directions. These are very unlikely to scale to an existential event, for the reasons detailed above.

Because all AIs are individually far from the capability level required for an existential event, coordination would need to be achieved between many AIs. Such an act would likely go against the interest of many other AIs, and all humans. I think a deliberate, coordinated multipolar takeover is a separate existential risk (a smaller xrisk than a Yudkowsky/Bostrom style takeover, but a larger xrisk than WFLL), and this option is discussed in more detail below.

AIs will go into production at different times (likely years apart)

Since, in WFLL, AIs reach powerful positions by getting deployed by humans, deployment will happen gradually, likely over decades. This means there is ample room for feedback: if an AI ruins one job or one part of a company or government, the next AI can be improved. If an AI acts against the interest of humans, there will be lawsuits against the company that employed it, and companies will either use an improved version, or no AI at all for this task. Governments will adopt regulations against AIs that act against human interests (this has happened in the EU already, despite competitive pressure, and is likely to increase and globalize). If a company run by an AI will use too much oxygen (such as occurs in WFLL), there will be a public outcry, lawsuits, political pressure, and plenty of time to regulate. There will be plenty of warning shots before anything goes wrong on a civilization-scale. I argue that the chance that all things required for an existential event happen at the same time, in a world where AI gets deployed gradually over years to decades, is tiny.

A coordinated multipolar takeover is different from WFLL

Going beyond WFLL, I do think there is something to be said for the idea that in a world where the percentage of intelligence that is human is ever-decreasing, the chance that humans lose control at some point is increasing. I’m thinking about this in similar terms of how the Roman empire collapsed: it gave ever more crucial defence tasks to frontier tribes with questionable loyalty. This worked for a long time, but in the end, coalitions of such tribes led by actors that turned out to finally not be loyal to Rome invaded the empire. Something similar could happen to us if we give ever-more important tasks to AI.

I do think, though, that for such an event to happen and to become existential, there would need to be a point where a coordinated team (potentially consisting of both humans and AIs, but led by an AI) gets more powerful than the most powerful human-led power structure (for example NATO, or the best IT defence we can bolster), and deliberately tries to take over, since it sees an advantage in doing so. I think this is conceptually quite a different threat model from WFLL.

The chance that such a deliberate, coordinated multipolar takeover would succeed may depend on:

  • How much of our society (economy, military, (social) media, politics, etc.) we cede to AI control.
  • How important (powerful) the fully automated sectors are.
  • How capable the AI is in running these sectors.
  • How easy it is for different AI-led sectors of society to coordinate. 

Trying to reduce all four bullet points should lead to decreased probability for a coordinated multipolar takeover, and therefore seems promising as a method of existential risk reduction (assuming we get this far).

I think this threat model is more likely to be existential than WFLL, but less likely than the Yudkowsky/Bostrom threat model, mostly since I expect AI to reach takeover capability based on their capability, not their position in society/point of application, before we will apply advanced AIs to run significant parts of the world (since application in the real world often lags by years to decades).

To reduce existential risk due to WFLL, classic regulation is required, instead of alignment or a pause

For any scenario that involves many AIs and happens after deployment, I argue that classic regulation, and not alignment or a pause, is what should be used to reduce risk. Classic regulation, in this sense, means:

  1. After technology development, and
  2. At the point of application.

For both the WFLL scenarios and the coordinated multipolar takeover, the existential event would happen after training the model, after commercialization, and likely even after the first wave of responsive regulation, generated by public pressure to policymakers, have taken place.

We are used to regulating tech in a trial-and-error way: first build it, then commercialize, then see where and how certain applications can be improved by creating regulation (often after public pressure). If WFLL is one’s main concern, this fits well with this (traditional) style of regulation: after invention and at the point of application. For this threat model, it is important whether a model is employed as a regular, relatively powerless worker-equivalent (possibly little regulation needed), or as a CEO or head of state-equivalent (heavy regulation or potentially prohibition needed). The model capability, which is sub-takeover, is less important than the amount of harm a model can do at a certain point of application. Therefore, regulation can be done after training and at the point of application, similar to e.g. the EU AI Act (the tiered approach makes sense here) or the recent talks between the US and China aiming to not apply AI in nuclear weapon systems. Of course the giant advantage of regulation after technology development is that everyone knows what the technology looks like and can estimate the risks much better. That makes drafting regulation and coordinating much easier.

In my opinion, alignment is often not required to reduce risk for either WFLL or coordinated multipolar takeover. If a sub-takeover AI model is deployed at a position where it can do little harm, it doesn’t need to be aligned at all. If it misbehaves, one can simply pull the plug (alignment might increase economic AI value, but this has nothing to do with existential safety). If an AI is employed at a position where it could cause casualties, it needs to be functioning well enough to avoid this. An example is a self-driving car: it needs to ‘act in the interest of humanity’ insofar that it does not kill people. However, to do this, the car only needs a model of driving safely, not a detailed ethical world model. This is of course very different once an AI reaches takeover capability, for which case one would need either alignment (and a pivotal act or positive offense/defence balance) or a pause.

Concluding, I hope that this post highlights that there is currently no consensus on existential threat models, and that different threat models typically require very different solutions (both technical and policy). While reducing the (possibly nonexistential) risks of WFLL can be done with traditional regulation, reducing the existential risks of a Yudkowsky/Bostrom style AI takeover should be done ahead of technology development, leading to very different risk mitigation measures. I hope this post can contribute to more structural threat model research, which I think is crucial, and to more structural and explicit coupling of risk-reducing measures (such as policy, alignment, and/or a pause) to specific threat models.

New Comment
12 comments, sorted by Click to highlight new comments since:

This is a topic I'd like to see discussed more on current margins, especially for the "Get What We Measure" scenario. But I think this particular post didn't really engage with what I'd consider the central parts of a model under which Getting What We Measure leaves humanity extinct. Even if I condition on each AI being far from takeover by itself, each AI running only a small part of the world, AIs having competing goals and not coordinating well, and AIs going into production at different times... that all seems to have near-zero relevance to what I'd consider the central Getting What We Measure extinction story.

The story I'd consider central goes roughly like:

  • Species don't usually go extinct because they're intentionally hunted to extinction; they go extinct because some much more powerful species (i.e. humans) is doing big things near by which shifts the environment enough that the weaker species dies. That extends to human extinction from AI. (This post opens with a similar idea.)
  • In the Getting What We Measure scenario, humans gradually lose de-facto control/power (but retain the surface-level appearance of some control), until they have zero de-facto ability to prevent their own extinction as AI goes about doing other things.

Picking on one particular line from the post (just as one example where the post diverges from the "central story" above without justification):

If a company run by an AI will use too much oxygen (such as occurs in WFLL), there will be a public outcry, lawsuits, political pressure, and plenty of time to regulate.

This only applies very early on in a Getting What We Measure scenario, before humanity loses de-facto control over media and governance. Later on, matters of oxygen consumption are basically-entirely decided by negotiation between AIs with their own disparate goals, and what humans have to say on the matter is only relevant insofar as any of the AIs care at all what the humans have to say.

Great to read you agree that threat models should be discussed more, that's in fact also the biggest point of this post. I hope this strangely neglected area can be prioritized by researchers and funders.

First, I would say both deliberate hunting down and extinction as a side effect have happened. The smallpox virus is one life form that we actively didn't like and decided to eradicate, and then hunted down successfully. I would argue that human genocides are also examples of this. I agree though that extinction as a side effect has been even more common, especially for animal species. If we would have a resource conflict with an animal species and it would be powerful enough to actually resist a bit, we would probably start to purposefully hunt it down (for example, orangutans attacking a logger base camp - the human response would be to shoot them). So I'd argue that the closer AI (or an AI-led team) is to our capability to resist, the more likely a deliberate conflict. If ASI blows us out of the water directly, I agree that extinction as a side effect is more likely. But currently, I think AI capabilities that increase more gradually, and therefore a deliberate conflict, is more likely.

I agree that us not realizing that an AI-led team almost has takeover capability would be a scenario that could lead to an existential event. If we realize soon that this could happen, we can simply ban the use case. If we realize it just in time, there's maximum conflict, and we win (could be a traditional conflict, could also just be a giant hacking fight, or (social) media fight, or something else). If we realize it just too late, it's still maximum conflict, but we lose. If we realize it much too late, perhaps there's not even a conflict anymore (or there are isolated, hopelessly doomed human pockets of resistance that can be quicky defeated). Perhaps the last case corresponds to the WFLL scenarios?

Since there's already, according to a preliminary analysis of a recent Existential Risk Observatory survey, ~20% public awareness of AI xrisk, and I think we're still relatively far from AGI, let alone from applying AGI in powerful positions, I'm pretty positive that we will realize we're doing something stupid and ban the dangerous use case well before it happens. A hopeful example are the talks between the US and China about not letting AI control nuclear weapons. This is exactly the reason though why I think threat model consensus and raising awareness are crucial.

I still don't see WFLL as likely. But a great example could change my mind. I'd be grateful if someone could provide that.

I think you are incorrect on dangerous use case, though I am open to your thoughts. The most obvious dangerous case right now, for example, is AI algorithmic polarization via social media. As a society we are reacting, but it doesn't seem like it is in an particularly effectual way.

Another way to see this current destruction of the commons is via automated spam and search engine quality decline which is already happening, and this reduces utility to humans. This is only in the "bit" universe but it certainly affects us in the atoms universe and as AI has "atom" universe effects, I can see similar pollution being very negative for us.

Banning seems hard, even for obviously bad use cases like deepfakes, though reality might prove me wrong(happily!) there.

Thanks for engaging kindly. I'm more positive than you are about us being able to ban use cases, especially if existential risk awareness (and awareness of this particular threat model) is high. Currently, we don't ban many AI use cases (such as social algo's), since they don't threaten our existence as a species. A lot of people are of course criticizing what social media does to our society, but since we decide not to ban it, I conclude that in the end, we think its existence is net positive. But there are pocket exceptions: smartphones have recently been banned in Dutch secondary education during lecture hours, for example. To me, this is an example showing that we can ban use cases if we want to. Since human extinction is way more serious than e.g. less focus for school children, and we can ban for the latter reason, I conclude that we should be able to ban for the former reason, too. But, threat model awareness is needed first (but we'll get there).

[-]Roko40

I think this analysis is totally wrong; in WFLL humans go extinct as a result of uncoordinated actions from many different AI-based actors in the same way that humans have nearly driven chimpanzees extinct without having a master plan to do so, or have actually extincted many other animal species.

Far from 0.1% I'd say WFLL has something like a 90% chance of causing human extinction as powerful optimizers that don't care about humans and that change the world a lot will generically cause human extinction.

My largest disagreement is here:

AIs will [...] mostly not want to coordinate. ... If they can work together to achieve their goals, they might choose to do so (in a similar way as humans may choose to work together), but they will often work against each other since they have different goals.

I would describe humans as mostly wanting to coordinate. We coordinate when there are gains from trade, of course. We also coordinate because coordination is an effective strategy during training, so it gets reinforced. I expect that in a multipolar "WFLL" world, AIs will also mostly want to coordinate.

Do you expect that AIs will be worse at coordination than humans? This seems unlikely to me given that we are imagining a world where they are more intelligent than humans and humans and AIs are training AIs to be cooperative. Instead I would expect them to find trades that humans do not, including acausal trades. But even without that I see opportunities for a US advertising AI to benefit from trade with a Chinese military AI.

Thanks for engaging. I think AIs will coordinate, but only insofar their separate, different goals are helped by it. It's not that I think AIs will be less capable in coordination per se. I'd expect that an AGI should be able to coordinate with us at least as well as we can, and coordinate with another AGI possibly better. But my point is that not all AI interests will be parallel, far from it. They will be as diverse as our interests, which are very diverse. Therefore, I think not all AIs will work together to disempower humans. If an AI or AI-led team tries to do that, many other AI-led and all human-led teams will likely resist, since they are likely more aligned with the status quo than with the AI trying to take over. That makes takeover a lot less likely, even in a world soaked with AIs. It also makes human extinction as a side effect less likely, since lots of human-led and AI-led teams will try to prevent this.

Still, I do think an AI-led takeover is a risk, or human extinction as a side effect if AI-led teams are way more powerful. I think partial bans after development at the point of application is most promising as a solution direction.

Both threat models involve many AIs. In both threat models, there does not seem to be a deliberate AI takeover (e.g. caused by a resource conflict), either unipolar or multipolar. Rather, the danger is, according to this model, that things are ‘breaking’, rather than ‘taking’. The existential event would be accidental, not on purpose.

What Failure Looks Like Part II is well described as "intentional AI Takeover". See also here.

I also think What Failure Looks Like Part I is more well described as an "intentional" AI takeover than you seem to be implying. And technical work could effectively address this threat model. In particular, there are AIs which are well described as understanding what's going on. So if humans knew what those AIs knew, then we could likely avoid issues.

I updated a bit after reading all the comments. It seems that Christiano's threat model, or in any case the threat model of most others who interpret his writing, seems to be about more powerful AIs than I initially thought. The AIs would already be superhuman, but for whatever reason, a takeover has not occured yet. Also, we would apply them in many powerful positions (heads of state, CEOs, etc.)

I agree that if we end up in this scenario, all the AIs working together could potentially cause human extinction, either deliberately (as some commenters think) or as a side-effect (as others think).

I still don't think that this is likely to cause human extinction, though, mostly for the following reasons:

- I don't think these AIs would _all_ act against human interest. We would employ a CEO AI, but then also a journalist AI to criticize the CEO AI. If the CEO AI would decide to let their factory consume oxygen to such an extent that humanity would suffer from it, that's a great story for the journalist AI. Then, a policymaker AI would make policy against this. More generally: I think it's a significant mistake in the WFLL threat models that the AI actions are assumed to be correlated towards human extinction. If we humans deliberately put AIs in charge of important parts of our society, they will be good at running their shop but as misaligned to each other (thereby keeping a power balance) as humans currently are. I think this power balance is crucial and may very well prevent things going very wrong. Even in a situation of distributional shift, I think the power balance is likely robust enough to prevent an outcome as bad as human extinction. Currently, some humans job is to make sure things don't go very wrong. If we automate them, we will have AIs trying to do the same. (And since we deliberately put them at this position, they will be aligned with humans' interests, as opposed to us being aligned with chimpanzee interest.)
- This is a very gradual process, where many steps need to be taken: AGI must be invented, trained, pass tests, be marketed, be deployed, likely face regulation, be adjusted, be deployed again. During all those steps, we have opportunities to do something about any threats that turn out to exist. This threat model can be regulated in a trial-and-error fashion, which humans are good at and our institutions accustomed to (as opposed to the Yudkowsky/Bostrom threat model).
- Given that current public existential risk awareness, according to our research, is already ~19%, and given that existential risk concern and awareness levels tend to follow tech capability, I think awareness of this threat will be near-universal before it could happen. At that moment, I think we will very likely regulate existentially dangerous use cases.

In terms of solutions:
- I still don't see how solving the technical part of the alignment problem (making an AI reliably do what anyone wants) contributes to reducing this threat model. If AI cannot reliably do what anyone wants, it will not be deployed at a powerful position, and therefore this model will not get a chance to occur. In fact, working on technical alignment will enormously increase the chance that AI will be employed at powerful positions, and will therefore increase existential risk as caused by the WFLL threat model (although, depending on pivotal act and offense/defence balance, solving alignment may decrease existential risk due to the Yudkowsky/Bostrom takeover model).
- An exception to this could be to make an AI reliably do what 'humanity wants' (using some preference aggregation method), and making it auto-adjust for shifting goals and circumstances. I can see how such work reduces this risk.
- I still think traditional policy, after technology invention and at the point of application (similar to e.g. the EU AI Act) is the most useful regulation to reduce this threat model. Specific regulation at training could be useful, but does not seem strictly required for this threat model (as opposed to in the Yudkowsky/Bostrom takeover model).
- If one wants to reduce this risk, I think increasing public awareness is crucial. High risk awareness should enormously increase public pressure to either not deploy AI at powerful positions at all, or demanding very strong, long-term, and robust alignment guarantees, which would all reduce risk.

In terms of timing, although likely net positive, it doesn't seem to be absolutely crucial to me to work on reducing this threat model's probability right now. Once we actually have AGI, including situational awareness, long-term planning, an adaptable world model, and agentic actions (which could still take a long time), we are likely still in time to regulate use cases (again as opposed to in the Yudkowsky/Bostrom takeover model, where we need to regulate/align/pause ahead of training).

After my update, I still think the chance this threat model leads to an existential event is small and work on it is not super urgent. However, I'm less confident now to make an upper bound risk estimate.

[-]momom21-1

I think you miss one important existential risk separate from extinction, which is having a lastingly suboptimal society. Like, systematic institutional inefficiency, and being unable to change anything because of disempowerment.
In that scenario, maybe humanity is still around because one of the things we can measure and optimize for is making sure a minimum amount of humans are alive, but the living conditions are undesirable.

Stretching the definition to include anything suboptimal is the most ambitious stretch I've seen so far. It would include literally everything that's wrong, or can ever be wrong, in the world. Good luck fixing that.

On a more serious note, this post is about existential risk as defined by eg Ord. Anything beyond that (and there's a lot!) is out of scope.

Not everything suboptimal, but suboptimal in a way that causes suffering on an astronomical scale (e.g. galactic dystopia, or dystopia that lasts for thousands of years, or dystopia with an extreme number of moral patients (e.g. uploads)).
I'm not sure what you mean by Ord, but I think it's reasonable to have a significant probability of S-risk from a Christiano-like failure.