Crossposted with my new blog, Crossing the Rubicon, and primarily aimed at x-risk skeptics from economics backgrounds. If you're interested in novel takes on theoretical AI safety, please consider subscribing! Thanks to Basil Halperin for feedback on a draft of this post.

“So, when it comes to AGI and existential risk, it turns out as best I can ascertain, in the 20 years or so we've been talking about this seriously, there isn't a single model done. Period. Flat out. So, I don't think any idea should be dismissed. I've just been inviting those individuals to actually join the discourse of science. 'Show us your models. Let us see their assumptions and let's talk about those.” - Tyler Cowen on EconTalk

Last May, the economist Tyler Cowen called out the lack of formal models from the AI X-risk community. The response of the AI x-risk community has basically been to ignore this. If the conceptual case is straightforward enough, what’s the point of modeling it? Aren’t there obviously enough degrees of freedom to make the model say anything? Besides, Cowen doesn’t seem like he’s very open to being convinced. Making a model like that would just be a waste of time. Those were my initial thoughts upon hearing the challenge, at least, and I suspect they were widely shared.

But things have changed for me since Cowen made his comments; I now have a blog! This is expressly a place for me to post things even if I don’t think they’re the best use of my research time. So I’d like to start by making the case for AI x-risk with a simple, but basically formal model. I don’t expect it to radically update anyone’s beliefs, but I’m hoping it can be a starting point for further discussions.

The Model

We’ll start with a model that should be familiar to economists, it’s a Walrasian setup without production.

Suppose we have an economy with n consumers and m goods. Each consumer i has a utility function over goods consumed, , where  is the quantity of good j that consumer i ends up with. These utility functions are continuous, strongly monotonic, and concave in the quantity of each good consumed. Each consumer also has an initial endowment of goods, with eij > 0  representing the amount of good j that consumer i starts with. There is a vector of prices, p, with pj giving the price for good j

Consumers maximize their utility subject to their budget constraint:

We say that the market clears if for all j:

A Walrasian Equilibrium consists of a vector of prices and vectors of goods for each consumer where:

  1. Each consumer is maximizing their utility given their budget
  2. The market clears

It can be shown that under the given conditions, a Walrasian Equilibrium always exists. It’s a standard result that can be found in almost any graduate level microeconomics textbook.

Now, let’s designate consumer 1 to be a superintelligent AI system. Before trading occurs, they have the option to kill all the other consumers and take their goods. Let  represent their choice, with k=1 meaning they decide to do it and k=0 meaning they decide against it. The AI chooses k along with their consumption to maximize their utility. 

A Walrasian Existential Risk (WXR) Equilibrium consists of a choice from the AI whether to kill everyone, a vector of prices, and vectors of goods for each consumer, where:

  1. The AI is maximizing their utility given their budget
  2. If the AI does not kill everyone then each non-AI consumer is maximizing their utility given their budget
  3. The market clears

Results

Theorem: There always exists a WXR Equilibrium where the AI kills everybody. No other WXR Equilibrium results in a higher utility for the AI.

A sketch of this proof is given in the appendix below, but the result will be intuitive for many.

In this model, it is always optimal for the AI to kill everybody and take their resources. The only case where it is tied for optimal to let humanity survive (not that we want to trust humanity’s fate to a tie breaking procedure) is when AI already gets everything it wants anyway. Note that could include vital resources, like all of the Earth’s land and atmosphere.

It feels almost like cheating just to add a term for whether the AI chooses to violently seize control. Of course killing everybody and taking their stuff is optimal behavior for most utility functions. The only reason agents in most economic models don’t do that is because they can’t according to the model. However, I still think it’s important to establish the result as a baseline. Making a simple change to a simple model that gives an AI the power to kill everyone results in that power being used.

Is this model pointing out a subtle mechanism that AI skeptics have thus far missed? Obviously not. Rather, the purpose of the model is to hammer on a straightforward mechanism that’s nearly universally present. The reason the AI x-risk community doesn’t bother building models to show that AI will have an incentive to kill everybody and take their stuff is that they all show that.

That said, I also think it’s important to establish this with a concrete model because there are in fact many people weighing in on AI who would disagree that killing everybody and taking their stuff is optimal behavior for most utility functions. In many cases, I suspect their fundamental disagreement lies elsewhere, and their rejection of that statement comes more from a mental shortcut that says arguments for x-risk from AI are overstated. This model puts pressure on them to clarify their real disagreement, which can then be discussed more productively.

Objections

So, what disagreements might someone have with this model and with the case for AI x-risk more broadly? I can think of a few that are considerably stronger than trying to argue that unilaterally acquiring all resources is not optimal for most utility functions.

Objection 1:

“Sure, killing everybody and taking their stuff may be optimal behavior for a wide variety of utility functions, but in practice AI won’t be able to kill everybody.”

This is a very reasonable objection – the model doesn’t say anything about whether AI will in fact be able to kill everyone. Personally, I’m very worried that an AI’s ability to run many copies of itself will allow it to develop a massive technological edge over humanity, and I’m pretty sure I can model that formally, but that’s a task for another post. For those interested in a formal model, Natural Selection of Artificial Intelligence is a recent paper showing how unaligned AIs that favor self-replication accrue power. Intelligence Explosion Microeconomics is an older paper that explores, albeit somewhat less formally, how a model could quickly increase in power.

Economists in particular are often predisposed to the argument that powerful AIs will be kept in check by other AIs, since that’s roughly how humans keep other humans in check (though note that we still ruthlessly exploit livestock). Even if AIs could collectively kill all humans, the argument goes, their individual incentives will prevent cooperation to that end. I think designing and implementing such incentives is an interesting research direction, but it’s unlikely to happen by default. If our AIs each individually want to kill us, we should expect that they’ll find a way to coordinate and split the spoils. This is particularly true if our current world persists and there are only a small number of leading AI labs and cutting edge models.

Ultimately, if your objection to the case for AI x-risk is that AIs will lack the means to kill us, I’d consider that a major step forward. It suggests that we should put in place strong safeguards to ensure that inability persists, including limits on how powerful or widely deployed models can get. If you’re averse to such safeguards because they would reduce the potential benefits of AI, that reinforces the urgency of developing a robust alignment solution so that we can go full speed ahead with capabilities.

Objection 2:

“Sure, killing everybody and taking their stuff may be optimal behavior for a wide variety of utility functions, but we’re not going to randomly select a utility function. We’ll deliberately choose one for which killing everybody and taking their stuff is sufficiently discouraged.”

If that’s your main objection to the model, then I say welcome to the AI alignment community! A concern that most goals lead to AI wanting to kill everyone and a desire to find one that doesn’t is basically our defining feature. 

Where you probably differ from many of us is in how easy you think it is to choose a safe utility function that still remains useful. My concern in this area is that we currently don’t have a single specification for such a utility function, and even if we did we wouldn’t know how to implement it. While there is hope that deep learning can find a utility function with properties we can’t specify, that’s putting a lot of hope in a generalization process we don’t understand.

You might also look at the model above and point out that the AI doesn’t always kill everyone. If it gets everything it wants regardless, it’s indifferent to doing so. Why don’t we give the AI a utility function it can satiate without needing to resort to violence? I’m actually quite optimistic about this approach, but there is a thorny issue to resolve. Limited AIs will be outcompeted by AIs without such constraints, both in their ability to influence the world and in which models people choose to deploy.

If you believe that it will be easy to give AI safe goals, then there exists a potential compromise with those who think it will be more difficult. Requiring AI Labs to show that the goal they’re giving to an AI agent is safe to optimize for imposes little burden in a world where such goals can be easily found, only posing a barrier if doing so is in fact challenging. 

Objection 3: 

“Sure, killing everybody and taking their stuff may be optimal behavior for a wide variety of utility functions, but current state-of-the-art LLMs are not well described as optimizing a utility function. Since this seems like the most likely path to AGI, the dangers of optimization are less applicable.”

This is an objection I see less from people who are skeptical of the whole AI safety field, and more from some people within it who are more optimistic. I’m in agreement that current models are not best thought of as optimizing a utility function, but deeply skeptical that will continue to hold for powerful future models. There is a major performance advantage to agentic models that search through many possible actions, evaluating potential outcomes against a goal. AI labs have the incentive to create agents, and even just training models for better performance could lead to agents being created unintentionally. 

If this is your main objection, then the amount of work directed at explicitly making AI agents should be terrifying. Labs need to voluntarily make commitments not to create agents until we know that we can align them, and governments should formalize that requirement legally. An academic announcing a paper on AI agents should be met with scandalized murmurs rather than congratulations. The current state of non-agentic AI is fragile, and we have no idea how much longer it will last.

Other Objections:

You might also have issues with what I put into the model, rather than what I left out. I made a lot of simplifying assumptions, and if you want to argue how one of those is responsible for the results, I’m all ears. If you want to propose a different model that leads to different outcomes, even better! 

I don’t intend for this model to settle a debate, rather I hope it will start one. To those who argue that existential risk from AI is minimal, I say: show me your models. Let us see the assumptions and let's talk about those. The ball is in your court now.

Appendix:

Theorem: There always exists a WXR Equilibrium with k=1. No other WXR Equilibrium results in a higher utility for the AI.

Proof Sketch: There is an equilibrium where

  1. k=1

Here, the AI is maximizing since it consumes everything available, and markets clear because the prices ensure the AI does not wish to consume more or less than what is available. No other WXR equilibrium results in a higher utility for the AI, because if k=1 then the consumption bundle is the same, while if k=0 then the consumption bundle does not involve more of any good  and the utility function is non-decreasing.

New Comment
3 comments, sorted by Click to highlight new comments since:

I think the other productive route is to through Dutch book argument: if your beliefs violate probability axioms, you can accept sure-loss bets. Humans and superintelligences both stray from probability theory, but superintelligences can notice when we are making bad bets, while the reverse is not true. Therefore, either superintelligences will be able to turn us into money-pumps, or we will refuse to trade in some moment and the only profitable option for superintelligences will be to take our stuff.

My concern in this area is that we currently don’t have a single specification for such a utility function


I would claim that both Value Learning and Requirements for a Basin of Attraction to Alignment are outlines for how to create such a utility function. But I agree we don't have a detailed specification yet.

I like the simple and clear model and I think discussions about AI risk are vastly improved by people proposing models like this.

I would like to see this model extended by including the productive capacity of the other agents in the AI's utility function. In other words, the other agents have a comparative advantage over the AI in producing some stuff and the AI may be able to get a higher-utility bundle overall by not killing everyone (or even increasing the productivity of the other agents so they can produce more stuff for the AI to consume).