There are many ways for AI systems to cause a catastrophe from which Earth-originating life could never recover. All of the following seem plausible to me:
Misuse: An AI system could help a human or group of humans to destroy or to permanently take over (and lock their values into) the world. The AI could be:
An oracle AI (e.g. a question-answering LLM)
An LLM simulating an intent-aligned agent and taking real-world actions via APIs
An intent-aligned RL agent
An interaction of multiple systems
Power-Seeking: An AI system could destroy or permanently take over the world on its own account, by leveraging advanced instruments of force projection. The AI could be:
"Goal misgeneralization": A surprise mesa-optimiser (most likely in model-free RL, but could conceivably arise through evolutionary processes in any iterative algorithm which has or learns sufficiently reality-like structure)
Economic Squeeze: an AI system could acquire nearly all means of production through a gradual process of individually innocent economic transactions, thereby squeezing humanity out of resource allocation decisions and removing most human influence over the future.
A single RL agent, or a unipolar tree of agents, might also do this, especially if they are successfully aligned to avoid use of force against humans.
Superpersuasion: an AI system could generate stimuli which reliably cause humans to adopt its arbitrary goals. The AI could be:
An LLM merely extrapolating from persuasive human text
An RL agent trained on human approval
A surprise mesa-optimiser
Some mixture of the above
Many AIs, collectively shaping a new human culture with an alien ideology
Security Dilemma: If AI-enabled technological advancements turn out to be offence-dominant, and if partial alignment success leads AIs to be unable to make credible commitments to each other (e.g. due to corrigibility), the equilibrium strategy for AI-enabled militaries may involve high-risk preemptive strikes and increasingly escalated retaliation to a point of existential catastrophe.
This would almost surely be a multipolar failure mode.
But, instead of trying to enumerate all possible failure modes and then trying to shape incentives to make them less likely to come up, I typically use a quasi-worst-case assumption in which I assume that, perhaps as a matter of bad luck with random initialisation,
when we optimise a function for a training objective,ties are broken in favour of functionswith the worst existential-risk implicationsfor the class of worlds in which they may be instantiated.
On the one hand, unlike a typical "prosaic" threat model, in the neorealist threat model one does not rely on empirical facts about the inductive biases of the kind of network architectures that are practically successful. A realist justification for this is that there may be a phase transition as architectures scale up which drastically changes both their capabilities profile and this kind of inductive bias (vaguely analogous to the evolution of cultural knowledge-transfer within biological life).
For me the core question of existential safety is this:
Under these conditions, what would bethe best strategy for building an AI systemthat helps us ethically end the acute risk periodwithout creating its own catastrophic risksthat would be worse than the status quo?
It is not, for example, "how can we build an AI that is aligned with human values, including all that is good and beautiful?" or "how can we build an AI that optimises the world for whatever the operators actually specified?" Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).
From a neorealist perspective, the ultimate criterion for "goodness" of an AI strategy is that it represents
a strong Pareto improvementover the default (or other) AI strategy profilefor an implementation-adequate coalition (approximately, a weighted majority) of strategically relevant AI decision-makers, relative to each of their actual preferences,if they were well-informed (to an extent that is feasible in reality).
I am optimistic about the plausibility of negotiations to adopt AI strategies that clear this bar, once such strategies become clear, even if they do not strictly meet traditional standards of "competitiveness". On the other hand, any strategy that doesn't clear this bar seems to require unrealistic governance victories to be implemented in reality. I hope this articulation helps to clarify the implications of governance/strategy upon the relative merits of technical safety research directions.
Threat Model
There are many ways for AI systems to cause a catastrophe from which Earth-originating life could never recover. All of the following seem plausible to me:
But, instead of trying to enumerate all possible failure modes and then trying to shape incentives to make them less likely to come up, I typically use a quasi-worst-case assumption in which I assume that, perhaps as a matter of bad luck with random initialisation,
when we optimise a function for a training objective,ties are broken in favour of functionswith the worst existential-risk implicationsfor the class of worlds in which they may be instantiated.On the one hand, unlike a typical "prosaic" threat model, in the neorealist threat model one does not rely on empirical facts about the inductive biases of the kind of network architectures that are practically successful. A realist justification for this is that there may be a phase transition as architectures scale up which drastically changes both their capabilities profile and this kind of inductive bias (vaguely analogous to the evolution of cultural knowledge-transfer within biological life).
On the other hand, unlike (a typical understanding of) a "worst-case assumption," the last clause leaves open the possibility of hiding concrete facts about our world from an arbitrarily powerful model, and the framing in terms of functions highlights an ontology of AI that respects extensional equivalence, where imputations of "deceptive mesa-optimisers hiding inside" are discarded in favour of "capable but misaligned outputs on out-of-distribution inputs".
One can make progress with this assumption by designing training contexts which couple safety guarantees to the training objective, e.g. a guarantee of shutdown within a time bound with arbitrarily high probability, and by working on ways to obtain instance-specific guarantees about learned functions that continue to hold out-of-distribution, e.g. with model-checking, regret bounds, or policy certificates.
Success Model
For me the core question of existential safety is this:
Under these conditions, what would bethe best strategy for building an AI systemthat helps us ethically end the acute risk periodwithout creating its own catastrophic risksthat would be worse than the status quo?It is not, for example, "how can we build an AI that is aligned with human values, including all that is good and beautiful?" or "how can we build an AI that optimises the world for whatever the operators actually specified?" Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).
From a neorealist perspective, the ultimate criterion for "goodness" of an AI strategy is that it represents
a strong Pareto improvementover the default (or other) AI strategy profilefor an implementation-adequate coalition (approximately, a weighted majority) of strategically relevant AI decision-makers, relative to each of their actual preferences,if they were well-informed (to an extent that is feasible in reality).I am optimistic about the plausibility of negotiations to adopt AI strategies that clear this bar, once such strategies become clear, even if they do not strictly meet traditional standards of "competitiveness". On the other hand, any strategy that doesn't clear this bar seems to require unrealistic governance victories to be implemented in reality. I hope this articulation helps to clarify the implications of governance/strategy upon the relative merits of technical safety research directions.
Related work