Prerequisite reading: Cognitive Neuroscience, Arrow's Impossibility Theorem, and Coherent Extrapolated Volition.

Abstract: Arrow's impossibility theorem poses a challenge to viability of coherent extrapolated volition (CEV) as a model for safe-AI architecture: per the theorem, no algorithm for aggregating ordinal preferences can necessarily obey Arrow's four fairness criteria while simultaneously producing a transitive preference ordering.  One approach to exempt CEV from these consequences is to claim that human preferences are cardinal rather than ordinal, and therefore Arrow's theorem does not apply.  This approach is shown to ultimately fail and other options are briefly discussed.


A problem arises when examining CEV from the perspective of welfare economics: according to Arrow's impossibility theorem, no algorithm for the aggregation of preferences can necessarily meet four common-sense fairness criteria while simultaneously producing a transitive result.  Luke has previously discussed this challenge.  (See the post linked above.)

Arrow's impossibility theorem assumes that human preferences are ordinal but (as Luke pointed out) recent neuroscientific findings suggest that human preferences are cardinally encoded.  This fact implies that human preferences - and subsequently CEV - are not bound by the consequences of the theorem.

However, Arrow's impossibility theorem extends to cardinal utilities with the addition of a continuity axiom.  This result - termed Samuelson's conjecture - was proven by Ehud Kalai and David Schmeidler in their 1977 paper "Aggregation Procedure for Cardinal Preferences."  If an AI attempts to model human preferences using a utility theory that relies on the continuity axiom, then the consequences of Arrow's theorem will still apply.  For example, this includes an AI using the von Neumann-Morgenstern utility theorem.

The proof of Samuelson's conjecture limits the solution space for what kind of CEV aggregation procedures are viable.  In order to escape the consequences of Arrow's impossibility theorem, a CEV algorithm must accurately model human preferences without using a continuity axiom.  It may be the case that we are living in a second-best world where such models are impossible.  This scenario would mean we must make a trade-off between employing a fair aggregation procedure and producing a transitive result.

Supposing this is the case, what kind of trade-offs would be optimal?  I am hesitant about weakening the transitivity criterion because an agent with a non-transitive utility function is vulnerable to Dutch-book theorems.  This scenario poses a clear existential risk.  On the other hand, weakening the independence of irrelevant alternatives criterion may be feasible.  My cursory reading of the literature suggests that this is a popular alternative among welfare economists, but there are other choices.

Going forward, citing Arrow's impossibility theorem may serve as one of the strongest objections against CEV.  Further consideration on how to reconcile CEV with Arrow's impossibility theorem is warranted.

New Comment
23 comments, sorted by Click to highlight new comments since: Today at 8:34 AM

Harsanyi's social aggregation theorem seems more relevant than Arrow.

And for anyone else who was wondering which condition in Kalai and Schmeidler's theorem fails for adding up utility functions, the answer as far as I can tell is cardinal independence of alternatives, but the reason is unsatisfying (again as far as I can tell): namely, restricting a utility function to a subset of outcomes changes the normalization used in their definition of adding up utility functions. If you're willing to bite the bullet and work with actual utility functions rather than equivalence classes of functions, this won't matter to you, but then you have other issues (e.g. utility monsters).

Edit: I would also like to issue a general warning against taking theorems too seriously. Theorems are very delicate creatures; often if their assumptions are relaxed even slightly they totally fall apart. They aren't necessarily well-suited for reasoning about what to do in the real world (for example, I don't think the Aumann agreement theorem is all that relevant to humans).

I would also like to issue a general warning against taking theorems too seriously. Theorems are very delicate creatures; often if their assumptions are relaxed even slightly they totally fall apart.

Is the criteria for antifragility formal enough that there could be a list of antifragile theorems?

No. The fragility is in humans' ability to misinterpret theorems, not in the theorems themselves, and humans are complex enough that I highly doubt that you'd be able to come up with a useful list of criteria that could guarantee that no human would ever misinterpret a theorem.

Suppose everyone had a utility function and we just added the utility functions, ignoring scaling. What allegedly goes wrong with this? And why do I care about it?

According to the Kalai and Schmeidler paper, the problem with this is that you're only allowed to know their utility functions up to translation and scaling. In order to aggregate people's preferences deterministically, you'd have to decide beforehand on a way of assigning a utility function based on revealed preferences (such as normalizing). But, according to Kalai and Schmeidler, this is impossible unless your scheme is discontinuous or if your scheme gives a different answer when you restrict to a subset of possible outcomes. (E.g., if you normalize so that an agent's least-favorite outcome has decision-theoretic utility 0, then the normalization will be different if you decide to ignore some outcomes.) You probably don't care because:

  • You don't care about aggregating preferences in a determinstic way; or
  • You don't care about your aggregation being continuous; or
  • You don't care if your aggregation gives different answers when you restrict to a subset of outcomes. ("Cardinal independence of irrelevant alternatives" in the paper.)

EDIT: Qiaochu said it first.

Stuart Armstrong has proved some theorems showing that it's really really hard to get to the Pareto frontier unless you're adding utility functions in some sense, with the big issue being the choice of scaling factor. I'm not sure even so, on a moral level - in terms of what I actually want - that I quite buy Armstrong's theorems taken at face value, but on the other hand it's hard to see how, if you had a solution that wasn't on the Pareto frontier, agents would object to moving to the Pareto frontier so long as they didn't get shafted somehow.

It occurred to me (and I suggested to Armstrong) that I wouldn't want to trade off whole stars turned into paperclips against individual small sentients on an even basis when dividing the gains from trade, even if I came out ahead on net against the prior state of the universe before the trade. I.e., if we were executing a gainful trade and the question was how to split the spoils, and some calculation took which ended up with the paperclip maximizer gaining a whole star's worth of paperclips from the spoils every time I gain one small-sized eudaimonic sentient, then my primate fairness calculator wants to tell the maximizer to eff off and screw the trade. I suggested to Armstrong that the critical scaling factor might revolve around equal amounts of matter affected by the trade, and you can also see how something like that might emerge if you were conducting an auction between many superintelligences (they would purchase matter affected where it was cheapest). Possibilities like this tend not to be considered in such theorems, and when you ask which axiom they violate it's often an axiom that turns out to not be super morally appealing.

Irrelevant alternatives is a common hinge on which such theorems fail when you try to do morally sensible-seeming things with them. One of the intuition pumps I use for this class of problem is to imagine an auction system in which all decision systems get to spend the same amount of money (hence no utility monsters). It is not obvious that you should morally have to pay the money only to make alternatives happen, and not to prevent alternatives that might otherwise be chosen. But then the elimination of an alternative not output by the system can, and morally should, affect how much money someone must pay to prevent it from being output.

Stuart Armstrong has proved some theorems showing that it's really really hard to get to the Pareto frontier unless you're adding utility functions in some sense, with the big issue being the choice of scaling factor.

He knows. Also, why do you say "really really hard" when the theorem says "impossible"?

It occurred to me (and I suggested to Armstrong) that I wouldn't want to trade off whole stars turned into paperclips against individual small sentients on an even basis when dividing the gains from trade, even if I came out ahead on net against the prior state of the universe before the trade.

I'm confused. How is this incompatible with maximizing a sum of utility functions with the paperclip maximizer getting a scaling factor of tiny or 0?

Can we say "being continuous with respect to the particular topology Kalai and Schmeidler chose, which is not obviously the correct topology to choose"? I would have chosen something like the quotient topology. The topology Kalai and Schmeidler chose is based on normalizations and, among other things, isolates the indifferent utility function (the one assigning the same value to all outcomes) from everything else.

Agreed, that's totally the wrong topology.

If you ignore scaling, don't you run into utility monster scenarios? And I wouldn't be so quick to say those are irrelevant in practice, since there would be a large incentive to become a utility monster, in the same way there's incentive in elections to stuff ballots if you can get away with it...

Suppose everyone had a utility function and we just added the utility functions, ignoring scaling. What allegedly goes wrong with this? And why do I care about it?

I am perplexed. Why are you asking this question? Isn't this stuff you learned about then theorized about while you were still a fetus? Is it meant rhetorically? As an exercise?

It's just the standard first question I ask for any theorem in this class. See my reply to Nisan below about some theorems by Stuart Armstrong.

I gather a group of people, brainwash them to have some arbitrary collection of values, then somehow turn them into utility monsters?

I was thinking, 'Find the person with the desires you can most easily satisfy - or whose desires allow the construction of the most easily-satisfied successor to the throne - and declare that person the utility monster by a trick of measurement.' But I don't think I quite understand the problem at hand. Eliezer sounds like he has in mind measurement-scheme criteria that would rule this out (as extrapolating the people who exist before the AI hopefully rules out brainwashing).

[-][anonymous]11y10

Supposing we can do that, there's good and bad news.

Good news: the resulting utility function is transitive so the AI will function without being susceptible to Dutch book arguments. (The AI won't accept a series of bets that it is assured to lose because of cyclical preferences.)

Bad news: the aggregation procedure (adding utility functions, ignoring scaling) fails at least one of Arrow's fairness criteria. [Edit: As Qiaochu_Yuan points out, I may have misunderstood the adding procedure defined in Kalai's paper vs the one you proposed. Under the adding procedure defined in the paper, the criteria failed is independence of irrelevant alternatives.]

You care about this because human values are fragile, and an imperfect aggregation procedure for CEV will shatter those values. Unless overcome, Arrow's impossibility theorem ensures that any such procedure will be imperfect.

The adding procedure Eliezer describes isn't even covered by the setup of the paper you linked to. Eliezer is assuming that people have actual utility functions, whereas Kalai and Schmeidler implicitly assume that only equivalence classes of utility functions up to translation and scaling are meaningful. (The adding procedure that is well-defined in Kalai and Schmeidler's setup doesn't fail dictatorship, it fails independence of irrelevant alternatives as I pointed out in my comment.)

This is another reason not to take theorems too seriously, which is that they often have implicit assumptions (in the setup of the problem, etc.) that are easy to miss if you only look at the statement of the theorem.

[-][anonymous]11y50

Ah, oops! Please excuse my mistake.

I couldn't access the "Aggregation Procedure for Cardinal Preferences" article. In any case, why isn't using an aggregate utility function that is a linear combination of everyone's utility functions (choosing some arbitrary number for each person's weight) a way to satisfy Arrow's criteria?

It should also be noted that Arrow's impossibility theorem doesn't hold for non-deterministic decision procedures. I would also caution against calling this an "existential risk", because while decision procedures that violate Arrow's criteria might be considered imperfect in some sense, they don't necessarily cause an existential catastrophe. Worldwide range voting would not be the best way of deciding everything, but it most likely wouldn't be an existential risk.

why isn't using an aggregate utility function that is a linear combination of everyone's utility functions (choosing some arbitrary number for each person's weight) a way to satisfy Arrow's criteria?

On first inspection, it looks like "linear combination of utility functions" still has issues with strategic voting. If you prefer A to B and B to C, but A isn't the winner regardless of how you vote, it can be arranged such that you make yourself worse off by expressing a preference for A over B. Any system where you reward people for not voting their preferences can get strange in a hurry.

Let me at least formalize the "linear combination of utility functions" bit. Scale each person's utility function so that their favorite option is 1, and their least favorite is -1. Add them together, then remove the lowest-scoring option, then re-scale the utility functions to the same range over the new choice set.

Arrow's Theorem doesn't say anything about strategic voting. The only reasonable non-strategic voting system I know of is random ballot (pick a random voter; they decide who wins). I'm currently trying to figure out a voting system that is based on finding the Nash equilibrium (which may be mixed) of approval voting, and this system might also be strategy-free.

When I said linear combination of utility functions, I meant that you fix the scaling factors initially and don't change them. You could make all of them 1, for example. Your voting system (described in the last paragraph) is a combination of range voting and IRV. If everyone range votes so that their favorite gets 1 and everyone else gets -1, then it's identical to IRV, and shares the same problems such as non-monotonicity. I suspect that you will also get non-monotonicity when votes aren't "favorite gets 1 and everyone else gets -1".

EDIT: I should clarify: it's not 1 for your favorite and -1 for everyone else. It's 1 for your favorite and close to -1 for everyone else, such that when your favorite is eliminated, it's 1 for your next favorite and close to -1 for everyone else after rescaling.

[-][anonymous]11y00

I apologize for not keeping up with this thread. (I've been very busy with end of the semester coursework and finals.) Now that I have the opportunity, I am reading through the comments. Very interested in knowing your thoughts.

Voted up for a clear abstract.